Kubewarden controller potentially doing too many requests to apiserver #645

brunorene · 2024-02-20T14:31:27Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

I noticed that compared to other operators, kubewarden controller seems to send a large amount of requests to the api server, probably to support reconciliation of the policy server. Based on information from my ElasticSearch instance here are some stats:

10 million logs in 24 hours coming from the kubewarden controller
From those we have 2,5M that are create requests for services, deployments and secrets called policy-server-default that always fail with 409 because those resources already exist. They are all spread evenly so it seems kw controller is always trying to create these 3 resources constantly. Each of these (services, deployments and secrets) have 5 log entries per second, all with 409 status
The big chunk are patch operations on deployments and configmaps also called policy-server-default , which total 1.7M logs. Here we get 2-3 logs lines for each deployment and configmaps per second
Next is update on services called policy-server-default that total 859K, these are more or less 1 per second
Finally 5M of them are get configmaps, 4-5 log lines per second

I think there should be some room from improvement by slowing down some of these cycles because, in my opinion, it should be enough to send these requests every 2-3 seconds, which would decrease dramatically the load on the api-server. Again I don't have a deep knowledge about the reason to these cadence of requests, but maybe some of these request categories can be streamlined to remove some of the load on the apiserver.

Thanks

Expected Behavior

I would exptect a lower load of requests on the apiserver

Steps To Reproduce

Analyse logs from ElasticSearch coming from the apiserver - usually called audit logs

Environment

- OS: Linux
- Architecture: amd64

Anything else?

Not really. Thanks

The text was updated successfully, but these errors were encountered:

flavio · 2024-02-23T08:40:02Z

I tried to reproduce the issue on a test cluster. I deployed the latest version of Kubewarden and enabled the default policies.

I didn't see any suspicious activity going on. There were regular WATCH and LIST events done by the controller, to monitor the status of the watched resources, but nothing so invasive. Aside from that, the audit-scanner feature was generating some noise, but only when the CronJob was started.

How many policies do you have deployed, are all active? Is there by any chance some policy that is in pending state, maybe because of a wrong configuration or because it points to a wasm module that cannot be fetched?

Is there something doing changes against the policy resources or against the PolicyServer resource?

This seems like the controller is trying to reconcile a state, something goes wrong, it does a couple of retries and then sleeps for X minutes. Then the cycle starts again 🤔

brunorene · 2024-02-27T22:31:01Z

We have 22 active policies at the moment, each on its own policy server, so 22 policy servers

brunorene · 2024-02-27T23:00:44Z

This is the amount of apiserver calls coming from the kuberwarden controller, during 15 minutes

brunorene · 2024-02-27T23:08:07Z

The final breakdown is 1 slot per policy server

Getting 1000 times a configmap in 15 minutes seems a lot :)

brunorene · 2024-02-27T23:20:25Z

Before continuing with this I'll check if it can be something from our side too

brunorene · 2024-02-27T23:29:40Z

Untitled discover search.csv

To be honest doing a 1 minute search with just a single policy server, it does not seem like a lot, but not sure if it can have some room for improvement

brunorene · 2024-02-27T23:35:51Z

Untitled discover search (1).csv.zip

Here are 15 mins with all policy servers

mpepping · 2024-04-03T09:18:07Z

Seeing the same behavior /w ~50 policies in a single policy-server instance. All request originate from the cattle-kubewarden-system:rancher-kubewarden-controller SA.

100+ request/second, matching the characteristics described in this issue (lots of 409's, 200's)
0-10 requests per second when all policies are deleted
0-10 request per seconds when loading policies in chunks of 5-10 policies,
100+ request/second when a certain combination or order of policies is loaded, although combinations are not fixed

I now have a scenario where I can reproduce the issue; 11 policies loaded (0-10 reqs/sec.) when the 12th policy is loaded, request surge to 100+ sec. When that 12th policy is deleted, requests turn back to 0-10 per sec.

The DOWNSIDE is that seems to can be caused by multiple combinations of policies. In this case, the policy triggering the issue is:

apiVersion: policies.kubewarden.io/v1
kind: ClusterAdmissionPolicy
metadata:
  annotations:
    io.kubewarden.policy.category: PSP
    io.kubewarden.policy.description: Pod Security Policy that controls usage of volumes
    io.kubewarden.policy.severity: medium
    io.kubewarden.policy.title: cluster-volumes policy in monitor for namespaces
    meta.helm.sh/release-name: kubewarden-policies
    meta.helm.sh/release-namespace: cattle-kubewarden-system
  creationTimestamp: "2024-04-02T07:47:03Z"
  finalizers:
  - kubewarden
  generation: 1
  labels:
    app.kubernetes.io/component: policy
    app.kubernetes.io/instance: kubewarden-policies
  name: cluster-volumes-monitor-namespaces
  resourceVersion: "23286973"
  uid: 3771d03b-59fa-4df8-9c38-611455909336
spec:
  backgroundAudit: true
  mode: monitor
  module: example.com/kubewarden/policies/volumes-psp:v0.1.11
  mutating: false
  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - cattle-kubewarden-system
      - cattle-system
      - kube-node-lease
      - kube-system
    - key: kubernetes.io/metadata.name
      operator: In
      values:
      - foo
      - bar
  policyServer: default
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - pods
  settings:
    allowedTypes:
    - configMap
    - downwardAPI
    - emptyDir
    - persistentVolumeClaim
    - secret
    - projected
  timeoutSeconds: 10

flavio · 2024-04-03T14:03:54Z

Thanks, this is definitely one of the issues we will look into during the next month

mpepping · 2024-05-06T11:23:31Z

Over the course of one hour we see a single clusterAdmissionPolicy being modified ~600 times, with the following status.conditions values:

A full audit event for one of the MODIFIED events:

{
  "type": "MODIFIED",
  "object": {
    "kind": "ClusterAdmissionPolicy",
    "metadata": {
      "uid": "dde481c0-9207-4916-9ede-782dba6c823e",
      "annotations": {
        "io.kubewarden.policy.title": "cluster-disallow-secret-in-env policy in monitor for namespaces",
        "meta.helm.sh/release-name": "kubewarden-policies",
        "meta.helm.sh/release-namespace": "cattle-kubewarden-system",
        "io.kubewarden.policy.category": "PSP",
        "io.kubewarden.policy.description": "Policy that inspects env vars and rejects a request if a secret was found",
        "io.kubewarden.policy.severity": "medium"
      },
      "labels": {
        "app.kubernetes.io/component": "policy",
        "app.kubernetes.io/instance": "kubewarden-policies",
        "app.kubernetes.io/managed-by": "Helm"
      },
      "name": "cluster-disallow-secret-in-env-monitor-namespaces"
    },
    "status": {
      "mode": "monitor",
      "policyStatus": "active",
      "conditions": [
        {
          "status": "False",
          "type": "PolicyActive",
          "lastTransitionTime": "2024-05-06T11:08:51Z",
          "message": "The policy webhook has not been created",
          "reason": "PolicyActive"
        },
        {
          "lastTransitionTime": "2024-05-06T11:08:51Z",
          "message": "The latest replica set is not uniquely reachable",
          "reason": "LatestReplicaSetIsNotUniquelyReachable",
          "status": "False",
          "type": "PolicyUniquelyReachable"
        },
        {
          "message": "Configuration for this policy is up to date",
          "reason": "ConfigurationVersionMatch",
          "status": "True",
          "type": "PolicyServerConfigurationUpToDate",
          "lastTransitionTime": "2024-04-23T18:30:45Z"
        }
      ]
    },
    "apiVersion": "policies.kubewarden.io/v1"
  }
}

Seem something is modifying the admissionPolicies once every 10 seconds. Hope this provides some clues in finding a cause.

jvanz · 2024-05-07T21:26:55Z

I'm thinking about this information about the policy status updates. My gut feeling tells me that we have an issue in the policy reconciliation indeed. Not only because of the information shared here. But when we run the integration tests we can see a lot of errors like this:

  2024-05-07T18:14:15-03:00     ERROR   Reconciler error        {"controller": "admissionpolicy", "controllerGroup": "policies.kubewarden.io", "controllerKind": "AdmissionPolicy", "AdmissionPolicy": {"name":"validating-policy-yu57kw2a","n
amespace":"admission-policy-controller-test"}, "namespace": "admission-policy-controller-test", "name": "validating-policy-yu57kw2a", "reconcileID": "7a775ee2-1688-456a-976a-1a25340423af", "error": "update admission policy status error: O
peration cannot be fulfilled on admissionpolicies.policies.kubewarden.io \"validating-policy-yu57kw2a\": the object has been modified; please apply your changes to the latest version and try again"}

I'm not sure if this is the only cause for the too many request sent to the API server. But it looks like to me that the controller has a bug in the policy reconciliation. Because many times when the controller tries to updates the status, the resource is not the latest version. I still do not know why that happens, maybe too many reconciliation request re enqueue? I do not know. I have not spend time on this yet. Just sharing a though here to not lose it. ;)

jvanz · 2024-05-27T17:43:52Z

Okay, let me share some thoughs I had while I was working in this issue during the past day.

First of all, sorry for taking so long to give an initial feedback. Busy days with personal stuff. But let's back to business....

I think the comments in this issue include 3 possibles improvements in our controller. They are:

Reduce the number of requests returning 409 errors. This should be mitigated in the PR. This PR uses the controller-runtime helper functions to decide if a resource should be created or updated/patched. This may increase a little bit the number of get request. But I do not expect that as an issue because most of the resources are cached. So, no more new hits in the API server.
Reduce the number of get requests for configmaps. This is indeed an issue. Our controller get the configmaps a lot of times in the reconciliation loops because the both policies and policy server reconciliation loops needs that information to proper set policy status and etc. However, we can mitigate this issue enabling the cache for this resource again. We did that in this PR.
The item 3 is related to the second one. But it will required a more careful change. The number of request sent to update policies statuses is big. This is the result of how the controller works. The policies reconciler waits for the policy server reconciler to finish the reconciliation loop to then properly set the policy status. Therefore, the policy reconcile loop get a lot of resources (this is why is related to the item 2) and set policy status multiple times while waiting for the policy server be ready to answer requests. However, to improve that, we need to reorganize the controller reconciliation loops trying to simplify them. Which will require bigger changes. Therefore, I'll open another issue to first investigate if there is another way to mitigate this. Or if this is possible in the first place and, if it is, fix it.

flavio · 2024-06-05T15:39:10Z

Fixed with the 1.13 release

github-project-automation bot added this to Kubewarden Feb 20, 2024

flavio added the kind/to-be-refined label Feb 20, 2024

flavio added the status/waiting-feedback Label to help visualize issue waiting for feedback from someone label Feb 27, 2024

brunorene changed the title ~~Kubewarden controller potentially doing to many requests to apiserver~~ Kubewarden controller potentially doing too many requests to apiserver Mar 21, 2024

flavio added this to the 1.13 milestone Apr 19, 2024

jvanz self-assigned this May 21, 2024

jvanz moved this to Todo in Kubewarden May 21, 2024

jvanz moved this from Todo to In Progress in Kubewarden May 21, 2024

jvanz mentioned this issue May 23, 2024

fix: cache configMap resources. #741

Merged

jvanz removed the kind/to-be-refined label May 27, 2024

jvanz mentioned this issue May 27, 2024

Review controller reconciliation loops to reduce the API server load and fix policy status update errors #743

Open

jvanz moved this from In Progress to Pending review in Kubewarden May 27, 2024

flavio closed this as completed in #741 May 28, 2024

github-project-automation bot moved this from Pending review to Done in Kubewarden May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubewarden controller potentially doing too many requests to apiserver #645

Kubewarden controller potentially doing too many requests to apiserver #645

brunorene commented Feb 20, 2024 •

edited

Loading

flavio commented Feb 23, 2024

brunorene commented Feb 27, 2024 •

edited

Loading

brunorene commented Feb 27, 2024 •

edited

Loading

brunorene commented Feb 27, 2024 •

edited

Loading

brunorene commented Feb 27, 2024

brunorene commented Feb 27, 2024 •

edited

Loading

brunorene commented Feb 27, 2024

mpepping commented Apr 3, 2024 •

edited

Loading

flavio commented Apr 3, 2024

mpepping commented May 6, 2024 •

edited

Loading

jvanz commented May 7, 2024

jvanz commented May 27, 2024 •

edited

Loading

flavio commented Jun 5, 2024

Kubewarden controller potentially doing too many requests to apiserver #645

Kubewarden controller potentially doing too many requests to apiserver #645

Comments

brunorene commented Feb 20, 2024 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

flavio commented Feb 23, 2024

brunorene commented Feb 27, 2024 • edited Loading

brunorene commented Feb 27, 2024 • edited Loading

brunorene commented Feb 27, 2024 • edited Loading

brunorene commented Feb 27, 2024

brunorene commented Feb 27, 2024 • edited Loading

brunorene commented Feb 27, 2024

mpepping commented Apr 3, 2024 • edited Loading

flavio commented Apr 3, 2024

mpepping commented May 6, 2024 • edited Loading

jvanz commented May 7, 2024

jvanz commented May 27, 2024 • edited Loading

flavio commented Jun 5, 2024

brunorene commented Feb 20, 2024 •

edited

Loading

brunorene commented Feb 27, 2024 •

edited

Loading

brunorene commented Feb 27, 2024 •

edited

Loading

brunorene commented Feb 27, 2024 •

edited

Loading

brunorene commented Feb 27, 2024 •

edited

Loading

mpepping commented Apr 3, 2024 •

edited

Loading

mpepping commented May 6, 2024 •

edited

Loading

jvanz commented May 27, 2024 •

edited

Loading