Capsule restarts often, probably due to GlobalTenantResources #1263

sandert-k8s · 2024-11-28T17:36:25Z

Bug description

The capsule controller pod has been restarted for 300 times in the last 14 days. I think it somehow suddenly fails at the step for reconciling the GlobalTenantResource, although we don't change anything in it. We experience this issue in 2 different Kubernetes clusters.

How to reproduce

Not completely sure how to reproduce, but our tenant setup quite basic (I deleted some options, but I don't see these were relevant for this bug):

apiVersion: capsule.clastix.io/v1beta2
kind: Tenant
metadata:
  annotations:
 labels:
    customer: customer1
    kubernetes.io/metadata.name: customer1-dev
  name: customer1-dev
spec:
  owners:
    - clusterRoles:
        - admin
        - capsule-namespace-deleter
      kind: Group
      name: customer1-dev-admin-group
  namespaceOptions:
    additionalMetadata:
      labels:
        customer: customer1
    quota: 25
  preventDeletion: false
  serviceOptions:
    allowedServices:
      externalName: true
      loadBalancer: true
      nodePort: false
    forbiddenAnnotations: {}
    forbiddenLabels: {}
  ingressOptions:
    hostnameCollisionScope: Disabled
  resourceQuotas:
    items:
      - hard:
          secrets: '200'
          services.loadbalancers: '3'
          persistentvolumeclaims: '50'
          resourcequotas: '25'
          openshift.io/imagestreams: '0'
          replicationcontrollers: '0'
          requests.memory: 10Gi
          pods: '200'
          requests.storage: 50Gi
          limits.cpu: '2'
          limits.memory: 10Gi
          configmaps: '200'
          services: '50'
          requests.cpu: '2'
    scope: Tenant
  storageClasses:
    allowed:
      - standard
  imagePullPolicies:
    - Always
  limitRanges: {}
  cordoned: false
  networkPolicies: {}

And the globalTenantResources:

apiVersion: capsule.clastix.io/v1beta2
kind: GlobalTenantResource
metadata:
  name: pullsecrets-customer1
spec:
  pruningOnDelete: true
  resources:
    - namespacedItems:
        - apiVersion: v1
          kind: Secret
          namespace: customer1-system
          selector:
            matchLabels:
              imagePullSecret: ourcustomcr.com
  resyncPeriod: 60s
  tenantSelector:
    matchLabels:
      customer: customer1
status:
  processedItems:
    - apiVersion: v1
      kind: Secret
      name: customer1-robot-pull-secret
      namespace: customer1-acc-test1
    - apiVersion: v1
      kind: Secret
      name: customer1-robot-pull-secret
      namespace: customer1-dev-test
  selectedTenants:
    - customer1-acc
    - customer1-dev

(The .status.processedItems were way more secrets, but I left 2 in for reference)

Expected behavior

Not crashing Capsule

Logs

I've added the logs of capsule from a pod when it crashes, with a few lines of log above it.

capsule.log

Additional context

Capsule version: 0.7.2
Helm Chart version: 0.7.2
Kubernetes version: OpenShift 4.15.31, based on K8s 1.28

The text was updated successfully, but these errors were encountered:

oliverbaehler · 2024-11-28T18:05:26Z

@sandert-k8s thanks for the report, i will look into it. What's the pod exit status (is there a OOMKill or just exit 1)?

prometherion · 2024-11-28T18:46:15Z

Nobody expects the concurrent map writes!

fatal error: concurrent map writes

The first thing I noticed from there, we need to fix the concurrency, definitely a bug.

sandert-k8s · 2024-11-28T19:33:50Z

Thanks for the fast replies!
@oliverbaehler

  containerStatuses:
    - restartCount: 334
      started: true
      ready: true
      name: manager
      state:
        running:
          startedAt: '2024-11-28T14:45:10Z'
      imageID: 'ghcr.io/github-ghcr/projectcapsule/capsule@sha256:f0d8f3d724f97179ca2fda1d75914f9373d3b1103ebd987de765fc2f1fd78377'
      image: 'ghcr.io/github-ghcr/projectcapsule/capsule:v0.7.2'
      lastState:
        terminated:
          exitCode: 2
          reason: Error
          startedAt: '2024-11-28T14:15:26Z'
          finishedAt: '2024-11-28T14:45:02Z'

So, exitCode 2.
It uses way less memory than the resources.request is set.

prometherion · 2024-11-29T10:22:11Z

We will use way more memory since the replication will rely on caching objects to avoid putting pressure on the API Server, it's by design.

The issue is not related to memory, but as the logs shared, we had concurrent map writes with the sets.Set utility in several goroutines to keep replication of objects as fast as possible.

I already opened a PR that fixes this issue, thanks for reporting!

sandert-k8s added blocked-needs-validation Issue need triage and validation bug Something isn't working labels Nov 28, 2024

sandert-k8s changed the title ~~Capsule restarts often, probably due tot GlobalTenantResources~~ Capsule restarts often, probably due to GlobalTenantResources Nov 28, 2024

prometherion removed the blocked-needs-validation Issue need triage and validation label Nov 28, 2024

prometherion self-assigned this Nov 29, 2024

prometherion mentioned this issue Nov 29, 2024

fix: concurrent map writes when replicating namespaced objects #1264

Merged

oliverbaehler closed this as completed in #1264 Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capsule restarts often, probably due to GlobalTenantResources #1263

Capsule restarts often, probably due to GlobalTenantResources #1263

sandert-k8s commented Nov 28, 2024 •

edited

Loading

oliverbaehler commented Nov 28, 2024

prometherion commented Nov 28, 2024

sandert-k8s commented Nov 28, 2024 •

edited

Loading

prometherion commented Nov 29, 2024

Capsule restarts often, probably due to GlobalTenantResources #1263

Capsule restarts often, probably due to GlobalTenantResources #1263

Comments

sandert-k8s commented Nov 28, 2024 • edited Loading

Bug description

How to reproduce

Expected behavior

Logs

Additional context

oliverbaehler commented Nov 28, 2024

prometherion commented Nov 28, 2024

sandert-k8s commented Nov 28, 2024 • edited Loading

prometherion commented Nov 29, 2024

sandert-k8s commented Nov 28, 2024 •

edited

Loading

sandert-k8s commented Nov 28, 2024 •

edited

Loading