Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(custom-resources): StateNotFoundError: State functionActiveV2 not found #24358

Closed
pahud opened this issue Feb 28, 2023 · 17 comments · Fixed by #25228
Closed

(custom-resources): StateNotFoundError: State functionActiveV2 not found #24358

pahud opened this issue Feb 28, 2023 · 17 comments · Fixed by #25228
Assignees
Labels
@aws-cdk/custom-resources Related to AWS CDK Custom Resources bug This issue is a bug. management/tracking Issues that track a subject or multiple issues p0

Comments

@pahud
Copy link
Contributor

pahud commented Feb 28, 2023

Please add your +1 👍 to let us know you have encountered this

Status: RESOLVED

Overview:

Any customer using custom resources may encounter the error when the custom resource handler lambda becomes INACTIVE.
Root cause: Lambda installs the SDK at 2.1055.0, but functionActiveV2 doesn't exist until 2.1080.0. It was reported in ap-south-2 but can happen in any region.

The error occurs anytime the custom resource provider framework fails to invoke the custom resource handler lambda. In that event, the framework will use functionActiveV2 to wait for the lambda to become active again. However, the call to functionActiveV2 will fail in the provider lambda because of root cause described above.

Complete Error Message:

Received response status [FAILED] from custom resource. Message returned: StateNotFoundError: State functionActiveV2 not found.

Workaround:

Solution:

use functionActive instead.

#25228

Related Issues:

#23862 (comment)

Original Issue

Describe the bug

When deploying 10+ custom resources in ap-south-2, it fails with StateNotFoundError: State functionActiveV2 not found error as below:

  1. It's fine to deploy 10+ custom resources in other AWS regions such as us-east-1 or ap-northeast-1. Only ap-south-2 fails in this case.
  2. When deploying with <10 custom resources, ap-south-2 will be fine with no error.

Expected Behavior

The provided code above should deploy in ap-south-2.

Current Behavior

image

Reproduction Steps

import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import * as lambda from "aws-cdk-lib/aws-lambda";
import { CustomResource } from "aws-cdk-lib";
import * as cr from "aws-cdk-lib/custom-resources";

export class DummyLambdaStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const onEventHandler = new lambda.Function(this, 'OnEventHandler', {
      runtime: lambda.Runtime.NODEJS_18_X,
      code: lambda.Code.fromInline(`exports.onEvent = () => { return }`),
      handler: 'index.onEvent',
    })

    const provider = new cr.Provider(this, "Provider", {
      onEventHandler,
    });

    const numResources = 10
    for (let i = 0; i < numResources; i++) {
      new CustomResource(this, `CR${i}`, {
        serviceToken: provider.serviceToken,
      });
    }
  }
}

Possible Solution

There might be some restrictions in ap-south-2.

Additional Information/Context

No response

CDK CLI Version

2.66.1 (build 539d036)

Framework Version

No response

Node.js Version

v16.17.0

OS

Linux

Language

Typescript

Language Version

No response

Other information

No response

@pahud pahud added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 28, 2023
@github-actions github-actions bot added the @aws-cdk/custom-resources Related to AWS CDK Custom Resources label Feb 28, 2023
@pahud
Copy link
Contributor Author

pahud commented Feb 28, 2023

related to #23862 (comment)

@pahud
Copy link
Contributor Author

pahud commented Feb 28, 2023

Same error if I change the runtime to Python 3.9

const onEventHandler = new lambda.Function(this, 'OnEventHandler', {
  runtime: lambda.Runtime.PYTHON_3_9,
  code: lambda.Code.fromInline(`def on_event(event, context): return {}`),
  handler: 'index.on_event',
})

@zelu-zuehlke
Copy link

zelu-zuehlke commented Feb 28, 2023

Same error in eu-central-2 region, except that we are using EKS (deployed with CDK), but we have more than 10 custom resources.
It seems to me that this is not directly related to EKS, but to the Lambda layer which is operating with the cluster.
Any update on this?

@pahud
Copy link
Contributor Author

pahud commented Mar 2, 2023

I am reaching out to the relevant team internally.

@pahud pahud self-assigned this Mar 2, 2023
@robert-carbmee
Copy link

It seems like this CDK commit is using the functionActiveV2 state from the JS SDK:
def2971

Then the change in CDK v2.60.0 switched off installLatestAwsSdk by default for custom resources, and from the PR it seems the default SDK version packaged by Lambda is 2.1055.0:
#23591

But the functionActiveV2 state seems to have been introduced in the SDK in version v2.1080.0 from this commit in the JS SDK repo:
aws/aws-sdk-js@488f6ad

So it seems the installLatestAwsSdk change is the culprit for the issue we are experiencing

@robert-carbmee
Copy link

I got around this for now by reverting to a CDK version older than 2.60.0

@KSSLR
Copy link

KSSLR commented Mar 21, 2023

Is there a way forward? I cannot go back to an older CDK version and I have to use isolated subnets.

@washimimizuku
Copy link

Any updates on this?

@pahud
Copy link
Contributor Author

pahud commented Mar 23, 2023

Thank you @robert-carbmee for the details. I have brought it up to the CDK core team. Will keep the update posted here whenever possible.

@MrArnoldPalmer
Copy link
Contributor

This seems like the right root cause as far as I can tell. I guess the reason this shows in ap-south-2 and not other regions is for some reason deploying certain number of custom resources causes some backup in lambda function creation that causes the functions to be pending, where in other regions we actually just never call waitFor('functionActiveV2' because the invoke doesn't fail...

I guess the easiest fix would be changing to use functionActive instead of functionActiveV2.

@salper
Copy link

salper commented Mar 28, 2023

FYI: same here in eu-west-1, using lambda backed custom resources.

@anthonygerrard
Copy link

We're seeing this issue today in us-east-1. We're using cdk 2.63.1 with 3 custom resources built using the provided:al2 docker image

@SamuraiPrinciple
Copy link

SamuraiPrinciple commented Apr 6, 2023

Seems to be related to concurrency - as a workaround, if you introduce (artificial) dependencies (using addDependency for example) it will reconcile successfully.

Here's an example (resources should be ordered in desired creation order):

[
  ...oidcClusterRoleManifests,
  clusterAutoscalerNamespace,
  clusterAutoscalerServiceAccount,
  clusterAutoscalerHelmChart,
  fluentBitNamespace,
  fluentBitServiceAccount,
  awsEbsCsiDriverNamespace,
  awsEbsCsiDriverServiceAccount,
  pixieHelmChart,
  externalDnsNamespace,
  externalDnsServiceAccount,
  externalDnsHelmChart,
  pmmNamespace,
  pmmServiceAccount,
  argocdHelmChart,
  deploymentRepoManifest,
  clusterBootstrapManifest,
  clusterAutoscalerAppManifest,
  argocdAppManifest,
  pixieAppManifest,
  externalDnsAppManifest,
]
  .reverse()
  .forEach((resource, index, resources) => {
    const nextResource = resources[index + 1];
    nextResource && resource.node.addDependency(nextResource);
  });

Obviously this is far from ideal as it makes the reconcilliation slower (you could however increase the concurrency by removing just enough dependencies to stay below the threashold when errors happen, but it's fiddly and fragile).

Would be nice to have a fix within the CDK though...

@kirnberger1980
Copy link

We are also facing this issue in eu-central-1. We can reproduce the issue with just 2 Custom Resources at the same time with the identical CR Provider.

@SamuraiPrinciple
Copy link

SamuraiPrinciple commented Apr 12, 2023

We are also facing this issue in eu-central-1. We can reproduce the issue with just 2 Custom Resources at the same time with the identical CR Provider.

For science (not as a proper solution), does the issue go away if you add to your construct:

cr1.node.addDependency(cr2)

just so you force the CDK to not provision them concurrently.

@pahud
Copy link
Contributor Author

pahud commented Apr 20, 2023

related to #24916

mergify bot pushed a commit that referenced this issue Apr 20, 2023
Replaces `functionActiveV2` with `functionActive`. 

`functionActiveV2` is not available in SDK versions < 2.1080.0, but the one that Lambda currently installs by default is 2.1055.0. The version that Lambda installs by default is the same that the CDK uses.

Closes #24358

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@comcalvi comcalvi changed the title (custom-resources): StateNotFoundError: State functionActiveV2 not found in ap-south-2 (custom-resources): StateNotFoundError: State functionActiveV2 not found Apr 21, 2023
@comcalvi comcalvi added the management/tracking Issues that track a subject or multiple issues label Apr 21, 2023
@comcalvi comcalvi pinned this issue Apr 21, 2023
mergify bot pushed a commit that referenced this issue Apr 21, 2023
…Cloud regions (#25215)

Reopening this PR because #25170 was closed by accident.

As ECR Public is not available in China regions and GovCloud, `AmazonElasticContainerRegistryPublicReadOnly` IAM managed policy would not be available in those affected regions and should not be attached to the role. This PR implements a CfnCondition to determine if ECR public is available based on `Aws.Partition` of the deploying region and conditionally attach `AmazonElasticContainerRegistryPublicReadOnly` to the kubectl-provider handler role. 

This PR has been tested in the following regions:

- [x] *cn-north-1
- [x] *cn-northwest-1
- [x] us-east-1

* I can confirm the role is created correctly in cn regions but due to 
   - #24358 
   - #24696  
The cluster and nodegroup are still failing to create in CN.

Closes #24743 #24808 #25178
comcalvi added a commit that referenced this issue Apr 21, 2023
Replaces `functionActiveV2` with `functionActive`. 

`functionActiveV2` is not available in SDK versions < 2.1080.0, but the one that Lambda currently installs by default is 2.1055.0. The version that Lambda installs by default is the same that the CDK uses.

Closes #24358

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
mergify bot pushed a commit to cdklabs/aws-cdk-notices that referenced this issue Apr 21, 2023
CLI notice for #24358. Inactive custom resource provider framework lambdas will cause errors due to Lambda's default SDK version (2.1055.0) not having `functionActiveV2` which only exists on 2.1080.0 and up.
@iliapolo iliapolo unpinned this issue May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/custom-resources Related to AWS CDK Custom Resources bug This issue is a bug. management/tracking Issues that track a subject or multiple issues p0
Projects
None yet
Development

Successfully merging a pull request may close this issue.