Warehouse does not reconcile #2836

wmiller112 · 2024-10-24T16:16:44Z

Checklist

I've searched the issue queue to verify this is not a duplicate bug report.
I've included steps to reproduce the bug.
I've pasted the output of kargo version.
I've pasted logs, if applicable.

Description

This is a continuation of this issue. I continue to experience warehouses not respecting spec.interval or reconciling when changed. I've observed the same behavior across ~20 different projects. It does not seem related to git/image repo because when the reconcile does finally trigger, it executes in a matter of seconds, as can be seen in the screenshot of logs. The reconcile occurs at fairly regular interval of close to 15-20 minutes when the interval is set to any time, ranging from 2m to 10m. Triggering a sync via gui adds annotation to warehouse, but even this does not trigger the reconcile. It eventually reconciles on the same interval.

Screenshots

Steps to Reproduce

Create a warehouse with spec similar to

spec:
  freightCreationPolicy: Automatic
  interval: 2m0s
  subscriptions:
    - git:
        branch: main
        commitSelectionStrategy: NewestFromBranch
        discoveryLimit: 20
        includePaths:
          - regex:path/to/manifests/.*
        repoURL: https://github.com/<org>/<repo>
    - image:
        allowTags: ^production-[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}-[0-9]{2}-[0-9]{2}Z-[0-9a-f]{40}$
        discoveryLimit: 20
        gitRepoURL: https://github.com/<org>/<repo>
        imageSelectionStrategy: Lexical
        repoURL: xxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/<repo>

Click refresh on the warehouse in gui
Watch it spin until it eventually reconciles

I'd expect immediately after clicking reconcile, the modified warehouse would be seen by the controller and reconcile - triggering a freight fetch.

Version

Client Version: v1.0.3
Server Version: v1.0.3

The text was updated successfully, but these errors were encountered:

krancour · 2024-10-24T17:39:42Z

@wmiller112 first, my apologies. I really thought we'd found it and nailed it in #2544 and #2754.

I know you don't suspect rate limiting at this point (nor do I), but I'm still curious if you observe any change in this behavior when you're subscribed to just one repo. In particular, if you subscribe only to the Git repo, does the situation improve?

Also, just to get more data here, is there any sharding involved here?

hiddeco · 2024-10-24T17:44:15Z

It may be an idea for us to bake an image with net/http/pprof enabled so we can collect a profile.

wmiller112 · 2024-10-24T21:41:53Z

Its the same for a warehouse with only a git subscription.

I've built/am running the controller image with pprof, and I'm able to collect profiles/traces, but am not exactly sure what to look for. Also added some additional logging around requeue interval. Confirmed requeue is being set to defined interval when reconcile does run.

Edit: Ah, I see the debug readme that was added in the pr instructs to collect heap profile

hiddeco · 2024-10-24T21:53:35Z

The things typically of interest are the following endpoints:

/debug/pprof/goroutine?debug=2 to see what goroutines are currently running, waiting, or blocked.
/debug/pprof/profile to see what the CPU is spending most time on.
/debug/pprof/heap to get a better insight into memory allocations.
/debug/pprof/trace?seconds=<number of seconds> to collect a trace (i.e., almost anything the program is doing over these seconds).

hiddeco · 2024-10-24T22:05:42Z

Some of these profiles can hypothetically contain slightly sensitive data (e.g., memory addresses or minimal operational details). If you want to share them, you can do so via mail: hidde@akuity.io.

hiddeco · 2024-10-24T23:31:28Z

My current working theory based on the traces shared by @wmiller112 in private, is that due to the concurrency configuration of our reconcilers currently being set to the default (which equals 1). Things are not being picked up in time because the controller is working on gathering the latest image build information for another Warehouse (which can be quite an expensive task depending on the number of images you have).

I have shared a branch with him (https://github.com/hiddeco/kargo/tree/allow-concurrency-config) that allows the configuration of the number of concurrent reconciliations to see if this proves I am right. If this is true, we should create a proper solution for this and think about the default configuration we want to ship with.

wmiller112 · 2024-10-25T02:54:00Z

With the MAX_CONCURRENT_RECONCILES set to 4 I definitely see more timely warehouse reconciles (down to about 5-10 minutes, with the interval still set to 2m). Interestingly the execution of the reconcile takes longer (was 10-20 seconds, after at about 1-2 min). These logs are for single warehouse, same as initial post:

It seems there is a point of (completely) diminishing returns; Increasing beyond 4 (tried 6, 8) reconcile seems to start on some warehouses and just hang for 20+ minutes. Have been keeping an eye on resource utilization and staying well within requests/limits

hiddeco · 2024-10-25T05:53:36Z

Would it be possible for you to send me another set of profiles with this configuration enabled?

hiddeco · 2024-10-28T22:48:49Z

After some more debugging through private channels, the reason for this can be summarized as follows:

At present, all reconcilers have a single worker configured. Because of this, anything that takes a long time will delay the processing of all other resources of the same kind.
When using (many) "newest build" selectors for container images in Warehouses, there is a high possibility they will run into rate limits from the registry at some point. Because there is a fair chance multiple Warehouses will be rate-limited due to the request coming from the same IP, the credentials being shared with other Warehouses, the rate limit being tied to the account/organization, or a combination — it will cause a flood of HTTP 429 responses. Combined with our fairly simple registry implementation, I suspect these are not detected correctly and retried, in addition, I suspect that running Goroutines that have already kicked off are not canceled as soon as a first error occurs.
We are not configuring GOMAXPROCS and GOMEMLIMIT to match CPU and memory limits while this is required for Go to properly detect the limits of the Pod, and not the limits of the host:
- https://blog.howardjohn.info/posts/gomaxprocs/
- https://www.ardanlabs.com/blog/2024/02/kubernetes-cpu-limits-go.html

To address this, we should take multiple actions:

Allow configuration of the number of workers, preferably per reconciler. The rationale for this is that not every kind is expected to be a long-running process. Warehouses and Promotions can be, but Stages should be relatively quick.

Ensure the chart properly detects CPU and memory limits and configures the earlier-mentioned environment variables. For example:

env:
- name: GOMEMLIMIT
  valueFrom:
    resourceFieldRef:
      containerName: controller
      resource: limits.memory
- name: GOMAXPROCS
  valueFrom:
    resourceFieldRef:
      containerName: controller
      resource: limits.cpu

Revise the registry logic, find the gaps that result in a performance impact, and seal them so it won't happen again.

ddeath · 2024-10-29T08:12:02Z

Hi, I am on 1.0.3 and I observing the similar behavior.

I have 1 project, 1 warehouse, 1 stage and it does not discover new docker images automatically... If I press button to refresh, it will find it but other otherwise it does not do anything.

apiVersion: kargo.akuity.io/v1alpha1
kind: Warehouse
metadata:
  name: testing-playground-warehouse
  namespace: app
spec:
  interval: "1m0s"
  subscriptions:
    - image:
        repoURL: registry.gitlab.com/xxxxxxx/testing-playground-service
        strictSemvers: false

ddeath · 2024-10-29T11:00:03Z

I deleted Project, Warehouse and Stage and recreated them again an now it works fine. Will try to do some stuff to get it stuck again.

ddeath · 2024-10-29T14:44:55Z

Just a note, I am observing this behavior again. I still have just 1 Project, 1 Stage, 1 Warehouse.

I dont see any errors except:

time="2024-10-29T14:40:32Z" level=error msg="finished streaming call" connect.code=internal connect.duration=17.922126759s connect.method=WatchWarehouses connect.service=akuity.io.kargo.service.v1alpha1.KargoService connect.start_time="2024-10-29T14:40:14Z" error="internal: context canceled"
time="2024-10-29T14:40:32Z" level=error msg="finished streaming call" connect.code=internal connect.duration=17.92214805s connect.method=WatchStages connect.service=akuity.io.kargo.service.v1alpha1.KargoService connect.start_time="2024-10-29T14:40:14Z" error="internal: context canceled"

So if it would be Rate limit I would expect to see some errors in pods logs. I dont really have reproducible steps yet, the only thing I was doing is changing Stage manifest (testing different imageSelectionStrategy

Let me know if you think this is completely different issue, I can open new one.

krancour · 2024-10-29T14:47:22Z

@ddeath you're looking at API server logs when you should be looking at controller logs.

ddeath · 2024-10-29T15:02:16Z

@krancour I checked all pods (kargo-api, kargo-controller, kargo-management-controoler, kargo-webhooks-server) and the only errors there where those from kargo-api

krancour · 2024-10-29T15:12:59Z

@ddeath, you'll notice @hiddeco and @wmiller112 had a lot of back and forth over the course of two issues to figure out what was going on here. @hiddeco has summarized findings nicely above. You and I also had some recent discussion about what sort of things can make Warehouse reconciliation slow (and therefore become a bottleneck). Please try to use this information to determine whether you may have one or more poorly-configured Warehouses that are taking a long time to reconcile and blocking others from reconciling at regular intervals. Enabling debug or even tracke-level logging may help as well.

ddeath · 2024-10-29T15:19:37Z

Yeah I am slowly doing that. However as I noted I have only 1 Warehouse with only 1 subscription (with private token, not public one so rate limits will be higher) and I am not seeing any reconcile happening for a long time (30 minutes). So just reporting my findings 🤷
But ok I promise will not write another comment with some sort of hard evidence of what is wrong in my case...

krancour · 2024-10-29T15:25:50Z

with only 1 subscription

What are its details? Is it a subscription you'd expect to perform poorly based on our other discussion?

wmiller112 added the kind/bug label Oct 24, 2024

github-actions bot added needs/priority needs/area labels Oct 24, 2024

krancour self-assigned this Oct 24, 2024

krancour added priority/high area/controller and removed needs/priority needs/area labels Oct 24, 2024

hiddeco mentioned this issue Oct 24, 2024

feat(controller): allow enabling pprof #2838

Merged

hiddeco self-assigned this Oct 24, 2024

krancour mentioned this issue Oct 29, 2024

feat: resource and concurrency settings #2872

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warehouse does not reconcile #2836

Warehouse does not reconcile #2836

wmiller112 commented Oct 24, 2024

krancour commented Oct 24, 2024

hiddeco commented Oct 24, 2024

wmiller112 commented Oct 24, 2024 •

edited

Loading

hiddeco commented Oct 24, 2024

hiddeco commented Oct 24, 2024

hiddeco commented Oct 24, 2024

wmiller112 commented Oct 25, 2024

hiddeco commented Oct 25, 2024

hiddeco commented Oct 28, 2024 •

edited

Loading

ddeath commented Oct 29, 2024

ddeath commented Oct 29, 2024

ddeath commented Oct 29, 2024

krancour commented Oct 29, 2024

ddeath commented Oct 29, 2024

krancour commented Oct 29, 2024

ddeath commented Oct 29, 2024

krancour commented Oct 29, 2024

Warehouse does not reconcile #2836

Warehouse does not reconcile #2836

Comments

wmiller112 commented Oct 24, 2024

Checklist

Description

Screenshots

Steps to Reproduce

Version

krancour commented Oct 24, 2024

hiddeco commented Oct 24, 2024

wmiller112 commented Oct 24, 2024 • edited Loading

hiddeco commented Oct 24, 2024

hiddeco commented Oct 24, 2024

hiddeco commented Oct 24, 2024

wmiller112 commented Oct 25, 2024

hiddeco commented Oct 25, 2024

hiddeco commented Oct 28, 2024 • edited Loading

ddeath commented Oct 29, 2024

ddeath commented Oct 29, 2024

ddeath commented Oct 29, 2024

krancour commented Oct 29, 2024

ddeath commented Oct 29, 2024

krancour commented Oct 29, 2024

ddeath commented Oct 29, 2024

krancour commented Oct 29, 2024

wmiller112 commented Oct 24, 2024 •

edited

Loading

hiddeco commented Oct 28, 2024 •

edited

Loading