-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AwsCustomResource: Race condition in IAM policy updates #18237
Comments
I see four potential fixes (there may be others):
|
If all is well, there should be {
"PublishCustomResourcePolicyDF696FCA": {
"Type": "AWS::IAM::Policy",
"Properties": {
"PolicyDocument": {
"Statement": [ { ... } ],
},
"Roles": [
{ "Ref": "AWS679f53fac002430cb0da5b7982bd2287ServiceRoleC1EA0FF2" }
]
}
},
"Publish2E9BDF73": {
"Type": "Custom::SNSPublisher",
"Properties": {
"ServiceToken": { ... },
"Create": { ... },
"Update": { ... },
},
"DependsOn": [
"PublishCustomResourcePolicyDF696FCA" // <------------ this
],
}
} This dependency will avoid the race condition. Can you confirm that this |
Here's snippets from the code. First from Construct A:
And then from Construct B (the one that sometimes fails):
The synthesized CloudFormation has the proper |
Alright, thanks for the thorough report. Seems like built-in retry logic would be the best solution. I'm just a little concerned that it might be hard to classify the right error code from all services, since I'm pretty sure there's no standard for it. We can probably start with |
I am experiencing this issue with The custom resource succeeds when turn off rollback and run the deploy multiple times - so I am certain its a policy propagation issue. In the meantime I broadened my permission policies but a retry on AccessDenied would be perfect. |
I have found this race condition as a consequence of disabling As observed in the analysis above from @lordjabez, the singleton lambda keeps running in the same CFN run and IAM permissions don't appear to propagate on time. I posit that such effect is only apparent because Lambda still running means the assumed role still uses the previous inline policy version, so even if IAM propagates the policy changes fast enough, Lambda may not spin a new function that uses them at all. I have had success by forcing a grant on the default identity policy of the underlying Lambda role because that policy is deployed early on and together with the The grant snippet below, using SSM as example, works for me as a workaround for now: const resource = existingResource ?? otherConstruct.node.defaultChild as AwsCustomResource;
const grant = Grant.addToPrincipal({
grantee: resource,
actions: ['ssm:GetParameter'],
resourceArns: AwsCustomResourcePolicy.ANY_RESOURCE,
});
grant.assertSuccess(); |
What is the problem?
I have a construct that performs several AWS SDK calls using
AwsCustomResource
, each of which should have an IAM policy scoped as tightly as possible. Sometimes, later calls fail with a permission error, even though the IAM policy being applied is correct.Per the documentation: "As this custom resource uses a singleton Lambda function, it's important to note the that function's role will eventually accumulate the permissions/grants from all resources."
What I discovered is that earlier custom resource calls succeed because it takes a little while (~60 seconds) for the Lambda to be created, allowing enough time for the associated IAM policy to propagate. However, subsequent custom resource calls reuse the Lambda, and if it executes too quickly after the policy is edited with the new permission, it won't have said permission and will fail.
Reproduction Steps
I can provide code on request but can't yet post publicly (working on getting permission to do so). But to reproduce should be straightforward:
If the second call happens fast enough after the policy is applied, it will fail.
What did you expect to happen?
All custom resource calls succeed.
What actually happened?
One of the custom resource calls failed with a permissions error.
CDK CLI Version
2.3.0 (build beaa5b2)
Framework Version
No response
Node.js Version
v16.7.0
OS
MacOS
Language
Typescript
Language Version
4.5.4
Other information
I have a CloudFormation event log I can share that illustrates the problem. Contact me directly and I can provide it.
The text was updated successfully, but these errors were encountered: