Race condition between reading the ack-role-account-map ConfigMap and resyncing breaks CARM #2011

nampnguyen · 2024-02-09T21:51:41Z

Describe the bug
When a controller starts while using CARM, particularly if the pod has a CPU limit specified (~250m), the account config map can sometimes be read after reconciliation starts. When this happens, the controller no longer assumes the appropriate cross account role and results in AccessDenied errors because the IRSA role does not have permissions to resources in other accounts.

The problem seems to be getting worse over time and may be related to the number of namespaces.

Even after the ack-role-account-map is read, the controller does not use the correct cross account role.

Debug logs (with some redactions):

{"level":"info","ts":"2024-02-09T19:54:29.148Z","logger":"setup","msg":"initializing service controller","aws.service":"dynamodb"}
{"level":"debug","ts":"2024-02-09T19:54:29.148Z","logger":"cache.namespace","msg":"Starting namespace cache","watchScope":[],"ignored":["kube-system","kube-public","kube-node-lease"]}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-1"}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-2"}
...
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-39"}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-40"}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Backup","resync period seconds":36000}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"GlobalTable","resync period seconds":36000}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Table","resync period seconds":36000}
{"level":"info","ts":"2024-02-09T19:54:29.171Z","logger":"setup","msg":"starting manager","aws.service":"dynamodb"}
{"level":"info","ts":"2024-02-09T19:54:29.171Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
...
{"level":"info","ts":"2024-02-09T19:54:29.171Z","msg":"Starting Controller","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table"}
{"level":"info","ts":"2024-02-09T19:54:29.273Z","msg":"Starting workers","controller":"adoptedresource","controllerGroup":"services.k8s.aws","controllerKind":"AdoptedResource","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"globaltable","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"GlobalTable","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"globaltable","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"GlobalTable","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"fieldexport","controllerGroup":"services.k8s.aws","controllerKind":"FieldExport","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"backup","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Backup","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"backup","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Backup","worker count":1}
{"level":"debug","ts":"2024-02-09T19:54:29.277Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"> r.Sync","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> r.resetConditions","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< r.resetConditions","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.ResolveReferences","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< rm.ResolveReferences","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.EnsureTags","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< rm.EnsureTags","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.ReadOne","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">>> rm.sdkFind","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.311Z","logger":"cache.account","msg":"created account config map","name":"ack-role-account-map"}
{"level":"debug","ts":"2024-02-09T19:54:29.349Z","logger":"ackrt","msg":"<<< rm.sdkFind","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3,"error":"AccessDeniedException: User: arn:aws:sts::111111111111:assumed-role/IRSARole/1707508469279464158 is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:us-east-1:111111111111:table/myapp because no identity-based policy allows the dynamodb:DescribeTable action\n\tstatus code: 400, request id: MH8E64L3D7BR9OJRI11BQ75FBNVV4KQNSO5AEMVJF66Q9ASUAAJG"}
{"level":"debug","ts":"2024-02-09T19:54:29.349Z","logger":"ackrt","msg":"<< rm.ReadOne","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3,"error":"AccessDeniedException: User: arn:aws:sts::111111111111:assumed-role/IRSARole/1707508469279464158 is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:us-east-1:111111111111:table/myapp because no identity-based policy allows the dynamodb:DescribeTable action\n\tstatus code: 400, request id: MH8E64L3D7BR9OJRI11BQ75FBNVV4KQNSO5AEMVJF66Q9ASUAAJG"}

Steps to reproduce

Configure CARM with a large number of namespaces (~30-40) and a CPU limit on the controller pod
Create an ACK resource
Restart the controller pod to trigger reloading of the config map and reconciliation

Expected outcome
Resource resyncing to wait until after the ack-role-account-map ConfigMap is read or future resyncs pivots to the cross account role if the ConfigMap is read after resyncing starts.

Environment

Kubernetes version: 1.27
Using EKS (yes/no), if so version? Yes, 1.27
AWS service targeted (S3, RDS, etc.): Multiple (observed with DynamoDB, IAM, S3, SQS)

The text was updated successfully, but these errors were encountered:

a-hilaly · 2024-02-10T01:51:55Z

@nampnguyen Is this happening only during controller startup, or have you observed it persisting even after multiple reconciliations?

We can change a few things in the ACK runtime to mitigate this:

On startup, we could wait the for CARM ShareInformers to sync, before spinning the reconcilers.
If a controllers is trying to use a CARM entry that wasn't populated/observed by the CARM sharedinfomers... we need to error and requeue (until it's properly propagated)

Quick note: We really need to start stress and load testing for ACK controllers, especially for components like CARM watchers.. i'm pretty sure there is a lot we can catch in there.

nampnguyen · 2024-02-10T02:58:09Z

@a-hilaly In our experience, if the reconciler starts before the ack-role-account-map ConfigMap is read, the issue persists for the life of the controller. The only option is to restart the controller and roll the dice again.

When we originally saw this issue when we had a CPU limit of 250m, but then increased it to 2000m. After the limit increase, the frequency of reconciler errors went down significantly. We implemented a cronjob to check for reconciler errors and restarted the controller if that happened which took care of the few times it popped back up. However, since that time the number of namespaces and accounts used by ACK has grown and we now see some controllers continue to experience reconciler errors even after dozens of restarts.

…tated namespaces (#138) Addresses aws-controllers-k8s/community#2011 In certain scenarios, where a user deploys a resource to a namespace annotated with a specific ownner accountID, a race condition was identified between the reconciler and the CARM (Cross Account Resource Management) `ConfigMap`. This race condition resulted in the controller setting an empty roleARN, preventing the aws-sdk-go client from pivoting (calling `STS::AssumeRole`) and managing resourecs in the correct account. Instead, resources were inadvertently managed in the default account instead of the namespace assigned account. This issue stemmed from the initial implementation of the CARM feature, where the method responsible for retrieving the accountID from the cache, didn't not properly verify the existance and content of the CARM configMap and instead returned an empty stringy when these conditions were not satisfied. This led to selection of the default account (when an empty `RoleARN` is returned )for resource management. Although these scenarios are rare, they can occur in clusters with a significantly high number of namespaces, causing a delay between naemsapce/configmap events and the informer's event handlers. This patch addresses the race issue by implementing two main things: - Proper error propagation: an error is no propagated when a `ConfigMap` is missing or when an accountID entry is missing in the `ConfigMap`. This helps the reconciler make the right decision on how to handle these cases. - Improved error handling: The reconciler now carefully handles these errors and requeues whenever a user has issued an owneraccountid-annotated namespace but the Configmap is not create or properly propagated. Signed-off-by: Amine Hilaly <hilalyamine@gmail.com> By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

a-hilaly · 2024-02-22T01:03:14Z

@nampnguyen we starting rolling out a fix for all the controllers - i'll ping here once dynamodb controller is patched.

Pin ACK runtime to v0.31.0. Mainly addressing two github issues: - Healthz, liveness and readiness probes aws-controllers-k8s/community#2012 - CARM Race condition and scaling issues aws-controllers-k8s/community#2011 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

nampnguyen · 2024-03-04T16:35:40Z

@a-hilaly Saw the controllers PRs that included the 0.31.0 runtime get closed, is the fix being re-worked? Thanks!

a-hilaly · 2024-03-04T17:01:29Z

@nampnguyen Looks like we also needed aws-controllers-k8s/runtime@861f7ed - we're rolling a new patch today :)

nampnguyen · 2024-03-11T15:18:04Z

@a-hilaly We tried the latest controllers using runtime 0.32.0, however are still seeing the issue. Debug logs from iam-controller 1.3.5:

{"level":"info","ts":"2024-03-11T13:47:53.770Z","logger":"setup","msg":"initializing service controller","aws.service":"iam"}
{"level":"debug","ts":"2024-03-11T13:47:53.770Z","logger":"cache.account","msg":"Starting shared informer for accounts cache","targetConfigMap":"ack-role-account-map"}
{"level":"debug","ts":"2024-03-11T13:47:53.770Z","logger":"cache.namespace","msg":"Starting namespace cache","watchScope":[],"ignored":["kube-system","kube-public","kube-node-lease"]}
{"level":"debug","ts":"2024-03-11T13:47:53.788Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-1"}
{"level":"debug","ts":"2024-03-11T13:47:53.788Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-13"}
{"level":"debug","ts":"2024-03-11T13:47:53.816Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"OpenIDConnectProvider","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.816Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Policy","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.816Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Role","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.817Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"User","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.817Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Group","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.817Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"InstanceProfile","resync period seconds":36000}
{"level":"info","ts":"2024-03-11T13:47:53.817Z","logger":"setup","msg":"starting manager","aws.service":"iam"}
...
{"level":"info","ts":"2024-03-11T13:47:53.928Z","msg":"Starting workers","controller":"policy","controllerGroup":"iam.services.k8s.aws","controllerKind":"Policy","worker count":1}
{"level":"debug","ts":"2024-03-11T13:47:53.936Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"> r.Sync","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> r.resetConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"<< r.resetConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> rm.ResolveReferences","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"<< rm.ResolveReferences","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> rm.EnsureTags","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"<< rm.EnsureTags","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> rm.ReadOne","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">>> rm.sdkFind","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.944Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
// Config map read here
{"level":"debug","ts":"2024-03-11T13:47:53.972Z","logger":"cache.account","msg":"created account config map","name":"ack-role-account-map"}
// Role is empty
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<<< rm.sdkFind","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10"}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<< rm.ReadOne","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10"}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":">> r.ensureConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":">>> rm.IsSynced","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<<< rm.IsSynced","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<< r.ensureConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"< r.Sync","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10"}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"> r.patchResourceStatus","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":">> kc.Patch (status)","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.038Z","logger":"ackrt","msg":"patched resource status","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"json":"{\"metadata\":{\"resourceVersion\":\"462345426\"},\"spec\":{...},\"status\":{\"conditions\":[{\"message\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10\",\"status\":\"True\",\"type\":\"ACK.Recoverable\"},{\"lastTransitionTime\":\"2024-03-11T13:47:54Z\",\"message\":\"Unable to determine if desired resource state matches latest observed state\",\"reason\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10\",\"status\":\"Unknown\",\"type\":\"ACK.ResourceSynced\"}]}}"}
{"level":"debug","ts":"2024-03-11T13:47:54.038Z","logger":"ackrt","msg":"<< kc.Patch (status)","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.038Z","logger":"ackrt","msg":"< r.patchResourceStatus","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"error","ts":"2024-03-11T13:47:54.038Z","msg":"Reconciler error","controller":"role","controllerGroup":"iam.services.k8s.aws","controllerKind":"Role","Role":{"name":"iam-role-1","namespace":"namespace-1"},"namespace":"namespace-1","name":"iam-role-1","reconcileID":"9d3c3474-c71e-4a60-a544-efe1bfe759c9","error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
// Role is empty
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"> r.Sync","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> r.resetConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"<< r.resetConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> rm.ResolveReferences","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"<< rm.ResolveReferences","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> rm.EnsureTags","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"<< rm.EnsureTags","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> rm.ReadOne","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">>> rm.sdkFind","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<<< rm.sdkFind","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626"}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<< rm.ReadOne","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626"}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":">> r.ensureConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":">>> rm.IsSynced","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<<< rm.IsSynced","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<< r.ensureConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"< r.Sync","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626"}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"> r.patchResourceStatus","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":">> kc.Patch (status)","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.091Z","logger":"ackrt","msg":"patched resource status","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"json":"{\"metadata\":{\"resourceVersion\":\"462345427\"},\"spec\":{...},\"status\":{\"conditions\":[{\"message\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626\",\"status\":\"True\",\"type\":\"ACK.Recoverable\"},{\"lastTransitionTime\":\"2024-03-11T13:47:54Z\",\"message\":\"Unable to determine if desired resource state matches latest observed state\",\"reason\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626\",\"status\":\"Unknown\",\"type\":\"ACK.ResourceSynced\"}]}}"}
{"level":"debug","ts":"2024-03-11T13:47:54.091Z","logger":"ackrt","msg":"<< kc.Patch (status)","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.091Z","logger":"ackrt","msg":"< r.patchResourceStatus","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"error","ts":"2024-03-11T13:47:54.091Z","msg":"Reconciler error","controller":"role","controllerGroup":"iam.services.k8s.aws","controllerKind":"Role","Role":{"name":"iam-role-2","namespace":"namespace-2"},"namespace":"namespace-2","name":"iam-role-2","reconcileID":"58482afd-89b1-45eb-8630-724deacf338c","error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

Even waiting several minutes, I'm not seeing the re-queuing behavior described in the PRs.

a-hilaly · 2024-03-13T06:11:13Z

@nampnguyen I will try to reproduce locally, maybe i'm missing something in here...
qq: does the namespace have an owner-account-id annotation? the only case where controllers will skip the requeue is if the namespaces aren't annotated..

nampnguyen · 2024-03-13T14:09:57Z

@a-hilaly Yes, the namespaces do have the services.k8s.aws/owner-account-id annotation. Looking at the fix, I think the issue we're currently experiencing may have been introduced by aws-controllers-k8s/runtime@24070d9

If I'm reading this correctly, this line here assumes that if the resource's Account ID is the same as the controller's IRSA account ID, then CARM is not used.

https://github.com/aws-controllers-k8s/runtime/blob/20e1c0d714a90e806d885082922fc3cc8d5358c1/pkg/runtime/reconciler.go#L1029

In the most recent logs I shared, the resource's Account ID and Controller account are in fact the same, but we still need the CARM role pivot. I think this can be fixed by always checking for the namespace annotation and if present use CARM.

a-hilaly · 2024-03-25T16:05:09Z

@nampnguyen makes sense~ working on a fix! I think we might also want to enable CARM even in single-namespace watch mode.

If I'm reading this correctly, this line here assumes that if the resource's Account ID is the same as the controller's IRSA account ID, then CARM is not used.

Correct. Not sure, but this might have been the case even before.

…tated namespaces (aws-controllers-k8s#138) Addresses aws-controllers-k8s/community#2011 In certain scenarios, where a user deploys a resource to a namespace annotated with a specific ownner accountID, a race condition was identified between the reconciler and the CARM (Cross Account Resource Management) `ConfigMap`. This race condition resulted in the controller setting an empty roleARN, preventing the aws-sdk-go client from pivoting (calling `STS::AssumeRole`) and managing resourecs in the correct account. Instead, resources were inadvertently managed in the default account instead of the namespace assigned account. This issue stemmed from the initial implementation of the CARM feature, where the method responsible for retrieving the accountID from the cache, didn't not properly verify the existance and content of the CARM configMap and instead returned an empty stringy when these conditions were not satisfied. This led to selection of the default account (when an empty `RoleARN` is returned )for resource management. Although these scenarios are rare, they can occur in clusters with a significantly high number of namespaces, causing a delay between naemsapce/configmap events and the informer's event handlers. This patch addresses the race issue by implementing two main things: - Proper error propagation: an error is no propagated when a `ConfigMap` is missing or when an accountID entry is missing in the `ConfigMap`. This helps the reconciler make the right decision on how to handle these cases. - Improved error handling: The reconciler now carefully handles these errors and requeues whenever a user has issued an owneraccountid-annotated namespace but the Configmap is not create or properly propagated. Signed-off-by: Amine Hilaly <hilalyamine@gmail.com> By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mattzech · 2024-05-23T18:11:34Z

@a-hilaly @nampnguyen We are also facing this exact issue when using most recent version of iam-controller. Glad to hear a fix is being worked on. Thanks!

mumlawski · 2024-06-12T08:39:26Z

@a-hilaly we do still face this same issue even after your pull request from April. Just wondering if someone is looking at this problem still?

a-hilaly · 2024-06-12T08:49:54Z

@mattzech @mumlawski @nampnguyen Sorry folks I somehow got side tracked from this issue. This is a high priority issue and I i'll ship a fix for it ASAP.

a-hilaly · 2024-08-06T05:11:21Z

@mumlawski @mattzech @nampnguyen we released a new iam-controller version fix the CARM/CM race condition issue

a-hilaly · 2024-08-07T17:09:25Z

/close

ack-prow · 2024-08-07T17:09:28Z

@a-hilaly: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

a-hilaly added area/runtime Issues or PRs as related to controller runtime, common reconciliation logic, etc kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Feb 9, 2024

a-hilaly self-assigned this Feb 9, 2024

a-hilaly mentioned this issue Feb 17, 2024

Resolve race condition between CARM ConfigMap and reconciler for annotated namespaces aws-controllers-k8s/runtime#138

Merged

a-hilaly mentioned this issue Feb 22, 2024

Pin ACK runtime to v0.31.0 aws-controllers-k8s/code-generator#514

Merged

a-hilaly added the area/carm Issues or PRs related to CARM (Cross Account Resource Management) label Mar 13, 2024

itaiatu mentioned this issue Jun 13, 2024

CARM ack-role-account-map ConfigMap updates are not propagated to ACK controllers #2088

Closed

ack-prow bot closed this as completed Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition between reading the ack-role-account-map ConfigMap and resyncing breaks CARM #2011

Race condition between reading the ack-role-account-map ConfigMap and resyncing breaks CARM #2011

nampnguyen commented Feb 9, 2024

a-hilaly commented Feb 10, 2024

nampnguyen commented Feb 10, 2024

a-hilaly commented Feb 22, 2024 •

edited

Loading

nampnguyen commented Mar 4, 2024

a-hilaly commented Mar 4, 2024

nampnguyen commented Mar 11, 2024

a-hilaly commented Mar 13, 2024

nampnguyen commented Mar 13, 2024 •

edited

Loading

a-hilaly commented Mar 25, 2024

mattzech commented May 23, 2024

mumlawski commented Jun 12, 2024

a-hilaly commented Jun 12, 2024

a-hilaly commented Aug 6, 2024

a-hilaly commented Aug 7, 2024

ack-prow bot commented Aug 7, 2024

Race condition between reading the ack-role-account-map ConfigMap and resyncing breaks CARM #2011

Race condition between reading the ack-role-account-map ConfigMap and resyncing breaks CARM #2011

Comments

nampnguyen commented Feb 9, 2024

a-hilaly commented Feb 10, 2024

nampnguyen commented Feb 10, 2024

a-hilaly commented Feb 22, 2024 • edited Loading

nampnguyen commented Mar 4, 2024

a-hilaly commented Mar 4, 2024

nampnguyen commented Mar 11, 2024

a-hilaly commented Mar 13, 2024

nampnguyen commented Mar 13, 2024 • edited Loading

a-hilaly commented Mar 25, 2024

mattzech commented May 23, 2024

mumlawski commented Jun 12, 2024

a-hilaly commented Jun 12, 2024

a-hilaly commented Aug 6, 2024

a-hilaly commented Aug 7, 2024

ack-prow bot commented Aug 7, 2024

a-hilaly commented Feb 22, 2024 •

edited

Loading

nampnguyen commented Mar 13, 2024 •

edited

Loading