Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capsule restarts often, probably due to GlobalTenantResources #1263

Closed
sandert-k8s opened this issue Nov 28, 2024 · 4 comments · Fixed by #1264
Closed

Capsule restarts often, probably due to GlobalTenantResources #1263

sandert-k8s opened this issue Nov 28, 2024 · 4 comments · Fixed by #1264
Assignees
Labels
bug Something isn't working

Comments

@sandert-k8s
Copy link

sandert-k8s commented Nov 28, 2024

Bug description

The capsule controller pod has been restarted for 300 times in the last 14 days. I think it somehow suddenly fails at the step for reconciling the GlobalTenantResource, although we don't change anything in it. We experience this issue in 2 different Kubernetes clusters.

How to reproduce

Not completely sure how to reproduce, but our tenant setup quite basic (I deleted some options, but I don't see these were relevant for this bug):

apiVersion: capsule.clastix.io/v1beta2
kind: Tenant
metadata:
  annotations:
 labels:
    customer: customer1
    kubernetes.io/metadata.name: customer1-dev
  name: customer1-dev
spec:
  owners:
    - clusterRoles:
        - admin
        - capsule-namespace-deleter
      kind: Group
      name: customer1-dev-admin-group
  namespaceOptions:
    additionalMetadata:
      labels:
        customer: customer1
    quota: 25
  preventDeletion: false
  serviceOptions:
    allowedServices:
      externalName: true
      loadBalancer: true
      nodePort: false
    forbiddenAnnotations: {}
    forbiddenLabels: {}
  ingressOptions:
    hostnameCollisionScope: Disabled
  resourceQuotas:
    items:
      - hard:
          secrets: '200'
          services.loadbalancers: '3'
          persistentvolumeclaims: '50'
          resourcequotas: '25'
          openshift.io/imagestreams: '0'
          replicationcontrollers: '0'
          requests.memory: 10Gi
          pods: '200'
          requests.storage: 50Gi
          limits.cpu: '2'
          limits.memory: 10Gi
          configmaps: '200'
          services: '50'
          requests.cpu: '2'
    scope: Tenant
  storageClasses:
    allowed:
      - standard
  imagePullPolicies:
    - Always
  limitRanges: {}
  cordoned: false
  networkPolicies: {}

And the globalTenantResources:

apiVersion: capsule.clastix.io/v1beta2
kind: GlobalTenantResource
metadata:
  name: pullsecrets-customer1
spec:
  pruningOnDelete: true
  resources:
    - namespacedItems:
        - apiVersion: v1
          kind: Secret
          namespace: customer1-system
          selector:
            matchLabels:
              imagePullSecret: ourcustomcr.com
  resyncPeriod: 60s
  tenantSelector:
    matchLabels:
      customer: customer1
status:
  processedItems:
    - apiVersion: v1
      kind: Secret
      name: customer1-robot-pull-secret
      namespace: customer1-acc-test1
    - apiVersion: v1
      kind: Secret
      name: customer1-robot-pull-secret
      namespace: customer1-dev-test
  selectedTenants:
    - customer1-acc
    - customer1-dev

(The .status.processedItems were way more secrets, but I left 2 in for reference)

Expected behavior

Not crashing Capsule

Logs

I've added the logs of capsule from a pod when it crashes, with a few lines of log above it.

capsule.log

Additional context

  • Capsule version: 0.7.2
  • Helm Chart version: 0.7.2
  • Kubernetes version: OpenShift 4.15.31, based on K8s 1.28
@sandert-k8s sandert-k8s added blocked-needs-validation Issue need triage and validation bug Something isn't working labels Nov 28, 2024
@sandert-k8s sandert-k8s changed the title Capsule restarts often, probably due tot GlobalTenantResources Capsule restarts often, probably due to GlobalTenantResources Nov 28, 2024
@oliverbaehler
Copy link
Collaborator

@sandert-k8s thanks for the report, i will look into it. What's the pod exit status (is there a OOMKill or just exit 1)?

@prometherion
Copy link
Member

Nobody expects the concurrent map writes!

fatal error: concurrent map writes

The first thing I noticed from there, we need to fix the concurrency, definitely a bug.

@prometherion prometherion removed the blocked-needs-validation Issue need triage and validation label Nov 28, 2024
@sandert-k8s
Copy link
Author

sandert-k8s commented Nov 28, 2024

Thanks for the fast replies!
@oliverbaehler

  containerStatuses:
    - restartCount: 334
      started: true
      ready: true
      name: manager
      state:
        running:
          startedAt: '2024-11-28T14:45:10Z'
      imageID: 'ghcr.io/github-ghcr/projectcapsule/capsule@sha256:f0d8f3d724f97179ca2fda1d75914f9373d3b1103ebd987de765fc2f1fd78377'
      image: 'ghcr.io/github-ghcr/projectcapsule/capsule:v0.7.2'
      lastState:
        terminated:
          exitCode: 2
          reason: Error
          startedAt: '2024-11-28T14:15:26Z'
          finishedAt: '2024-11-28T14:45:02Z'

So, exitCode 2.
It uses way less memory than the resources.request is set.

@prometherion
Copy link
Member

We will use way more memory since the replication will rely on caching objects to avoid putting pressure on the API Server, it's by design.

The issue is not related to memory, but as the logs shared, we had concurrent map writes with the sets.Set utility in several goroutines to keep replication of objects as fast as possible.

I already opened a PR that fixes this issue, thanks for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants