Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(certificatemanager): DnsValidatedCertificate timeout while waiting for certificate approval #2914

Closed
Labels
@aws-cdk/aws-certificatemanager Related to Amazon Certificate Manager bug This issue is a bug. effort/medium Medium work item – several days of effort p1

Comments

@KnisterPeter
Copy link
Contributor

Describe the bug
Creating certificates via certificate manager and route54 DNS validation fails with a timeout.
Error message:

Failed to create resource. Resource is not in the state certificateValidated

Expected behavior
The lambda waiting for the approval should probably wait more than the hardcoded 5 minutes right now.

Version:

  • OS: linux
  • Programming Language: typescript
  • CDK Version: 0.33.x
@KnisterPeter KnisterPeter added the bug This issue is a bug. label Jun 18, 2019
@NGL321 NGL321 added needs-triage This issue or PR still needs to be triaged. @aws-cdk/aws-route53 Related to Amazon Route 53 and removed needs-triage This issue or PR still needs to be triaged. labels Jun 18, 2019
@NGL321
Copy link
Contributor

NGL321 commented Jun 19, 2019

Could you give some more steps to how you got to the error message?

@NGL321 NGL321 added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jun 19, 2019
@KnisterPeter
Copy link
Contributor Author

Sure, I've used this code fragment:

    new certificatemanager.DnsValidatedCertificate(this, 'id', {
      domainName: 'some-name',
      hostedZone: zone
    })

And during cdk deploy the above error was thrown after some time. When I looked in to certificate manager console then, I saw that the requested certificate was indeed still in pending validation.

Therefore I think its a timing issue, and in the lambda code of the dns validation there is a wait statement for 5 minutes. If I'm right this may be a bit too short.

https://github.com/awslabs/aws-cdk/blob/master/packages/%40aws-cdk/aws-certificatemanager/lambda-packages/dns_validated_certificate_handler/lib/index.js#L142

@RomainMuller
Copy link
Contributor

The runtime for the whole execution may not exceed 15 minutes. The function is currently waiting for up to 5 minutes for the DNS record to commit, then waits up to 5 minutes for the ACM validation to happen.... That does not leave much margin.

RomainMuller added a commit that referenced this issue Jun 20, 2019
Allow the Lambda function to wait up to 9 minutes and 20 seconds before
bailing out waiting for the domain to be validated. It used to be
waiting no more than 5 minutes and would occasionally timeout on users.

Fixes #2914 (hopefully)
RomainMuller added a commit that referenced this issue Jun 20, 2019
Allow the Lambda function to wait up to 9 minutes and 20 seconds before
bailing out waiting for the domain to be validated. It used to be
waiting no more than 5 minutes and would occasionally timeout on users.

Fixes #2914 (hopefully)
@KnisterPeter
Copy link
Contributor Author

@RomainMuller Thanks, that will probably help in a lot of situations. Unfortunately the certificate manager claims to approve pending certificate requests in at least 30 minutes. So there is still a lot of room to fail. But I think this will help a lot.

RomainMuller added a commit that referenced this issue Jun 20, 2019
Allow the Lambda function to wait up to 9 minutes and 20 seconds before
bailing out waiting for the domain to be validated. It used to be
waiting no more than 5 minutes and would occasionally timeout on users.

Fixes #2914 (hopefully)
@abend-arg
Copy link

For those running with this problem, use instead the Certificate construct. It allows you to achieve the very same thing without time limit. Something like this:

        const certificate = new acm.Certificate(this, `${PREFIX}LandingPageAcmCertificate`, {
            domainName: SITE_DOMAIN,
            subjectAlternativeNames: [`www.${SITE_DOMAIN}`],
            validation: acm.CertificateValidation.fromDns(rootHostedZone)
        });

@peterwoodworth peterwoodworth changed the title ACM timeout while waiting for certificate approval (certificatemanager): ACM timeout while waiting for certificate approval Nov 2, 2021
@peterwoodworth peterwoodworth removed @aws-cdk/aws-route53 Related to Amazon Route 53 needs-reproduction This issue needs reproduction. labels Nov 2, 2021
@github-actions github-actions bot added the @aws-cdk/aws-certificatemanager Related to Amazon Certificate Manager label Nov 2, 2021
@njlynch njlynch changed the title (certificatemanager): ACM timeout while waiting for certificate approval (certificatemanager): DnsValidatedCertificate timeout while waiting for certificate approval Nov 3, 2021
@njlynch
Copy link
Contributor

njlynch commented Nov 3, 2021

For those experiencing this issue:

Unless you absolutely need cross-region certificate issuance (e.g., requesting a us-east-1 certificate from another region for CloudFront), then converting to use the Certificate construct (as @AbendGithub notes above) is your best bet. The Certificate construct does not have the same time-out constraints as DnsValidatedCertificate and uses CloudFormation's internal workflow system for provisioning and validating.

If you must use DnsValidatedCertificate, give yourself the best possible chance of success by creating and deploying your Route53 HostedZone first, validating the domain with tools like dig, nslookup, etc., and only then adding the certificate to the deployment. See https://docs.aws.amazon.com/acm/latest/userguide/troubleshooting-DNS-validation.html for a list of common DNS validation troubleshooting tips. In particular, if something like % dig yourhostname.example.com does not return the 4 name servers associated with your hosted zone prior to starting the deployment, your certificate will never validate.

@njlynch njlynch removed their assignment Nov 8, 2021
@BillyBunn
Copy link

BillyBunn commented Nov 18, 2021

@njlynch Unfortunately I'm experiencing the same timeout issue, even with the Certificate construct. I've tried using both.

DnsvalidatedCertificate timed out after a few minutes with

CREATE_FAILED | AWS::CloudFormation::CustomResource 
Received response status [FAILED] from custom resource. Message returned: Resource is not in the state certificateValidated
... stacktrace

Certificate timed out after a few hours with

CREATE_FAILED | AWS::CertificateManager::Certificate 
Certificate is in PENDING_VALIDATION status
... stacktrace

Also, both ways are unable to delete the failed stack because of DNS record sets created in the same deployment that pointed at a CloudFront alias (probably should be a separate issue).

DELETE_FAILED | AWS::Route53::HostedZone
The specified hosted zone contains non-required resource record sets  and so cannot be deleted.

Ran into this trying to deploy a static site (S3 bucket, CloudFront distribution, Route53 hosted zone, ACM certificate) with a domain registered already with Route53. I have noticed also what @acdoussan mentioned—the name servers for the registered domain do not match the hosted zone NS records made by PublicHostedZone.

Anything obvious that is causing this? My code:

    const websiteBucket = new s3.Bucket(this, "WebsiteBucket", {
      autoDeleteObjects: true,
      publicReadAccess: false,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    const websiteHostedZone = new route53.PublicHostedZone(this, "WebsiteHostedZone", {
      zoneName: 'domain-name.com',
    });

    // Have also tried `DnsValidatedCertificate`
    const websiteCertificate = new certificateManager.Certificate(this, "WebsiteCertificate", {
      domainName: 'domain-name.com',
      subjectAlternativeNames: ['www.domain-name.com'],
      validation: certificateManager.CertificateValidation.fromDns(websiteHostedZone),
    });

    const websiteBucketDistribution = new cloudfront.Distribution(this, "WebsiteBucketDistribution", {
      certificate: websiteCertificate,
      defaultBehavior: {
        origin: new origins.S3Origin(websiteBucket),
        viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
      },
      defaultRootObject: "index.html",
      domainNames: ['domain-name.com'],
    });

    new route53.ARecord(this, "WebsiteARecord", {
      target: route53.RecordTarget.fromAlias(new targets.CloudFrontTarget(websiteBucketDistribution)),
      recordName: 'domain-name.com',
      zone: websiteHostedZone,
    });

    new route53.AaaaRecord(this, "WebsiteAAAARecord", {
      target: route53.RecordTarget.fromAlias(new targets.CloudFrontTarget(websiteBucketDistribution)),
      recordName: 'domain-name.com',
      zone: websiteHostedZone,
    });

Edit:
Can recreate with simply this

    const websiteHostedZone = new route53.PublicHostedZone(this, "WebsiteHostedZone", {
      zoneName: 'domain-name.com',
    });

    // Have also tried `DnsValidatedCertificate
    const websiteCertificate = new certificateManager.Certificate(this, "WebsiteCertificate", {
      domainName: 'domain-name.com',
      subjectAlternativeNames: ['www.domain-name.com'],
      validation: certificateManager.CertificateValidation.fromDns(websiteHostedZone),
    });

@ekeyser
Copy link

ekeyser commented Dec 22, 2021

I notice that when the zones are for domains that have not been purchased (lack NS registrar records) this happens. I suppose that makes some sort of sense since we're talking about domain ownership. I was doing testing and didn't want to buy a domain just for testing some cdk/cloudformation code. Maybe this note will help someone. Just sayin'.

@dpistole
Copy link

dpistole commented Jan 30, 2022

@BillyBunn, might be a long shot, but I switched to Certificate and my deploy started hanging as well. I never let it time out but I noticed in my gmail spam folder I had a bunch of emails from AWS re: Certificate Approval with a link that I had to click to approve the certificate. I marked them not as spam and tried again; clicking the approve link seemed to do the trick.

I switched back to the DNS validated cert afterward, and that one seems to work if I wait for the hostedZone to get created, then use its name servers to update the name servers section under registered domains via the UI. The deploy hangs while I do that but then seems to finish up.

@lehotskysamuel
Copy link

lehotskysamuel commented Jul 22, 2022

I'm sorry but I believe this can only be properly fixed by Amazon internal team.

The problem is that DnsValidatedCertificate works by creating a custom resource with lambda that adds those records and then waits for validation. But since this is a lambda, there is a max run time of 15 minutes. Yet based on comments above, validating certificates may take hours on us-east-1. I've been currently waiting on validation for 49 minutes and it's still not validated.

As to why we have to use the DnsValidatedCertificate: We are a team in Europe, with our main region being Ireland: eu-west-1. There are many certificates that require certs placed in N. Virginia: us-east-1. That rules out the regular acm.Certificate class because that class will only deploy to the main region.

We also don't want a separate stack that deploys into us-east-1 because then you cannot export certificate ARN and import it into another stack. Fn::importValue only works within the same region.

Workarounds: The only workaround right now is to deploy it in a separate stack into us-east-1, then have a second stack that exports certificate values which are hard-coded as strings (manual step) and then have a third stack which actually uses those values.

One other workaround is to retry stack deployment early in the morning when it seems to get validated in time - but that is highly unreliable.

Solutions: Well ideally you could internally push for making certificate validations faster in that region and guarantee validations under 15 minutes. Or implement an API to do cross-region certificate creations, so CloudFormation would support this scenario natively (without the lambda). Or don't force us to deploy certificates to a specific region (us-east-1), then we could all happily use the acm.Certificate class.

I've never really used CustomResource, so don't know much about that. But is there a way to run something else than a lambda that might run for longer?

If you can't do any of that, you could at least make the stack deployments idempotent. Problem is that the custom resource lambda fails and triggers a rollback, which orphans the certificate and new re-deployment doesn't use the original cert that might be already validated. There would be no problem if I could: deploy a stack, wait for it to fail due to lambda timeout, wait until certificate is valdiated, re-deploy - and it will pickup the original certificate and successfully complete.

Does it really need to fail and trigger rollback? How come the main acm.Certificate within one region works?

At the very least this issue should be documented on the cdk page for DnsValidatedCertificate construct.

@skrud-dt
Copy link

skrud-dt commented Nov 23, 2022

I think should be fixable by using CustomResourceProvider instead of just a straight CustomResource. Then the awaiting could happen using isComplete instead of inline right after creating the certificate. Can CDK constructs use CustomResourceProvider?

(I may be able to take as tab at this)

mergify bot pushed a commit that referenced this issue Jan 25, 2023
Now that the official CloudFormation resource `AWS::CertificateManager::Certificate` (CDK's `Certificate` construct) supports DNS validation we do not want to recommend using the `DnsValidatedCertificate` construct.

The `DnsValidatedCertificate` construct uses CloudFormation custom resources to perform the certificate creation and this creates a lot of maintenance burden on our team (see the list of linked issues). Currently the primary use case for using `DnsValidatedCertificate` over `Certificate` is for cross region use cases. For this use case I have updated the README to have our suggested solution.

The example in the README is tested in this [integration test](https://github.com/aws/aws-cdk/blob/main/packages/@aws-cdk/aws-cloudfront/test/integ.cloudfront-cross-region-cert.ts)

fixes #8934, #2914, #20698, #17349, #15217, #14519


----

### All Submissions:

* [ ] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md)

### Adding new Unconventional Dependencies:

* [ ] This PR adds new unconventional dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-new-unconventional-dependencies)

### New Features

* [ ] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)?
	* [ ] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)?

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment