Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression issue on 1.5.1 #124

Closed
michelzanini opened this issue Dec 28, 2020 · 30 comments
Closed

Regression issue on 1.5.1 #124

michelzanini opened this issue Dec 28, 2020 · 30 comments

Comments

@michelzanini
Copy link

After upgrading to 1.5.1 I am getting the following error:

Error: health check timeout: Head "https://elasticsearch.mydomain.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available

It could be related to aws_assume_role_arn as I use it on my provider config:

provider "elasticsearch" {
  url                 = "https://elasticsearch.mydomain.com"
  aws_region          = "eu-west-1"
  aws_profile         = ""
  aws_assume_role_arn = "arn:aws:iam::111111111:role/Role"
  sign_aws_requests   = true
}

It only seems to happen if I use aws_assume_role_arn and it does not when I use aws_profile.
I am using Elasticsearch 7.9.

Reverting back to 1.5.0 and the error disappears.

I see there's significant changes done in this PR #119, maybe it's related.

Thanks.

@phillbaker
Copy link
Owner

phillbaker commented Dec 29, 2020

Hello, sorry to hear you're having issues. It sounds like this might be related to f924ab6 (#114)

Thanks for providing details and an example provider config.

It only seems to happen if I use aws_assume_role_arn and it does not when I use aws_profile.

I'm not quite following here, are you saying that a different provider config does work in v1.5.1? (Can you share/clarify examples?)

@michelzanini
Copy link
Author

I use Terragrunt to write a different Terraform file depending if I am on a CI environment or on a laptop.

When on a laptop, this is the config I use:

provider "elasticsearch" {
  url                 = "https://elasticsearch.mydomain.com"
  aws_region          = "eu-west-1"
  aws_profile         = "my_profile"
  aws_assume_role_arn = ""
  sign_aws_requests   = true
}

When on CI env, this is the one I use:

provider "elasticsearch" {
  url                 = "https://elasticsearch.mydomain.com"
  aws_region          = "eu-west-1"
  aws_profile         = ""
  aws_assume_role_arn = "arn:aws:iam::111111111:role/Role"
  sign_aws_requests   = true
}

On a laptop, it uses aws_profile. On CI server, it uses aws_assume_role_arn.
On 1.5.0 both config files work.
On 1.5.1, it seems only the laptop with aws_profile works.

@phillbaker
Copy link
Owner

Thanks @michelzanini. Any chance the CI is running on EKS (#112)?

@michelzanini
Copy link
Author

No, it's running on a standard ec2 instance

@Delorien84
Copy link

I can confirm that aws_assume_role_arn is not working on 1.5.1. It is running on EC2 instance with IAM role attached to that instance.

When I turn off healthcheck the execution block indefinitely .

My configuration is very similar:

provider "elasticsearch" {
  url                 = "https://custom.domain.com"
  aws_region          = "eu-west-1"
  aws_assume_role_arn = "arn:aws:iam::111111111:role/Role"
  sign_aws_requests   = true
}

@lifeofguenter
Copy link

For us we use aws_profile but it stopped working with 1.5.1:

provider "elasticsearch" {
  url               = "https://${module.logs_elasticsearch_remote.outputs.elasticsearch_endpoint}"
  aws_profile       = var.aws_profile
  sign_aws_requests = true
}

however, our profile looks like this:

[our-profile]
region            = us-east-1
credential_source = Ec2InstanceMetadata
role_arn          = arn:aws:iam::111111111111:role/ROLE_NAME

works fine on 1.5.0

@phillbaker
Copy link
Owner

Sorry for the delay here, I've reverted part of f924ab6 and tagged a v1.5.2-beta (https://github.com/phillbaker/terraform-provider-elasticsearch/tree/v1.5.2-beta). That should get pushed to terraform registry shortly. Can you all please give that try and let me know if this is resolved?

@phillbaker
Copy link
Owner

Hello, following up on this. Has anyone been able to try v1.5.2-beta?

@lifeofguenter
Copy link

lifeofguenter commented Jan 12, 2021

On our side it did not fix the issue unfortunately:

[2021-01-12T09:50:31.485Z] - Using phillbaker/elasticsearch v1.5.2-beta from the shared cache directory

[2021-01-12T09:51:07.021Z] Error: health check timeout: Head "https://sssss.us-east-1.es.amazonaws.com": RequestCanceled: request context canceled
[2021-01-12T09:51:07.021Z] caused by: context deadline exceeded: no Elasticsearch node available
[2021-01-12T09:51:07.021Z] 
[2021-01-12T09:51:07.021Z] 
[2021-01-12T09:51:07.021Z] 
[2021-01-12T09:51:07.021Z] Error: no active connection found: no Elasticsearch node available

reverting to 1.5.0 still works

@phillbaker
Copy link
Owner

Thanks. I reverted the upgrade of the AWS client and released v1.5.2-beta1, can folks on this thread give that a try and update here?

@phillbaker
Copy link
Owner

Hi all, following up on this, has this been fixedin 1.5.2-beta1?

@phillbaker
Copy link
Owner

HI all, 1.5.2 has been released, I'm going to close this as fixed - I don't have a way to reproduce, so I can't test directly. Please re-open if there are further issues.

@michelzanini
Copy link
Author

Sorry I did not have time to test this before. I tested with 1.5.4 and it seems it still not working.

@michelzanini
Copy link
Author

I can confirm the commit that introduced this regression issue was #119.
I build binaries for every commit until it broke starting on that one.

I am going to have a deeper look now to see if I can spot the issue, but 100% it was there.
@phillbaker

@phillbaker
Copy link
Owner

Thanks @michelzanini that's very helpful. That strikes me as very odd, as #119 is primarily a change in timing of calls, as opposed to what calls are being made.

In order to narrow down the issue, could you try the following:

  • try setting sniff to false in the provider config
  • try setting elasticsearch_version to the correct elasticsearch version to skip pinging the cluster when creating a client

@michelzanini
Copy link
Author

Even with sniff and elasticsearch_version I still get the errors:

Error: health check timeout: Head "https://elasticsearch.mydomain.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available

  on main.tf line 8, in resource "elasticsearch_opendistro_role" "read_indexes_role":
   8: resource "elasticsearch_opendistro_role" "read_indexes_role" {



Error: health check timeout: Head "https://elasticsearch.mydomain.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available

  on main.tf line 59, in resource "elasticsearch_opendistro_user" "developer_users":
  59: resource "elasticsearch_opendistro_user" "developer_users" {



Error: no active connection found: no Elasticsearch node available

  on main.tf line 72, in resource "elasticsearch_opendistro_ism_policy" "ism_policy":
  72: resource "elasticsearch_opendistro_ism_policy" "ism_policy" {

If I also set healthchek to false, then there's no error but the resources are never created and Terraform keeps running indefinitely. All resources keep printing Still creating... [100...s elapsed] etc...

This leads me to believe that there's some sort of race condition. I can't find the problem myself and I do not have enough Go or Elasticsearch knowledge to find this on my own.

I will park this for now and keep locked to 1.5.0.
Do you consider maybe reverting that PR #119 ?

Or else you can test this by creating one AWS instance and a Elasticsearch cluster, assign a IAM role to the box and run Terrafrom from there...

@michelzanini
Copy link
Author

Not sure this will help but this is the logs that keeps like this forever:

(...)
elasticsearch_opendistro_role.read_indexes_role: Still creating... [40s elapsed]
2021/04/08 12:42:33 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:36 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:37 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
2021/04/08 12:42:38 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:41 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:42 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
elasticsearch_opendistro_role.read_indexes_role: Still creating... [50s elapsed]
2021/04/08 12:42:43 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:46 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:47 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
2021/04/08 12:42:48 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:51 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:52 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
elasticsearch_opendistro_role.read_indexes_role: Still creating... [60s elapsed]
2021/04/08 12:42:33 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:36 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:37 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
2021/04/08 12:42:38 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:41 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:42 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
(...)

@phillbaker
Copy link
Owner

phillbaker commented Apr 9, 2021

Do you consider maybe reverting that PR #119 ?

Unfortunately, #119 touches too many pieces of code to revert now.

Or else you can test this by creating one AWS instance and a Elasticsearch cluster, assign a IAM role to the box and run Terrafrom from there...

I don't currently have access to an AWS environment where I can test this unfortunately.

@phillbaker
Copy link
Owner

Here's one guess I have: the deferred instantiation of the client means that the client is initialized once per resource, versus once at provider instantiation. This may be a problem if there are many resources (which also require reads to prepare a plan) and the AWS client needs to query resources like the EC2 metadata API (which is rate limited).

@michelzanini @lifeofguenter approximately how many elasticsearch_* resources are being managed in terraform?

@michelzanini
Copy link
Author

michelzanini commented Apr 12, 2021

Hi @phillbaker, that makes whole lot of sense. I have around 10 resources more or less. Although you don't have AWS resources to test, you can still probably test this behaviour with debugging?

@lifeofguenter
Copy link

we also did not have a lot of resources. Maybe around 10 as well.

We heavily monitored IMDS and other rate-limits as this was indeed a general issue, but was not the cause in this case - I think.

I dont think this can be tested easily though...

I would most probably look into how other providers utilize aws-sdk. I do know though especially for signed requests and ES that there are some additional quirks.

I am not actively using this provider anymore else I would invest some time. I think using earlier versions is just fine for most use cases.

@michelzanini
Copy link
Author

I can confirm this has been fixed on 1.5.7.

@marksumm
Copy link

This may be fixed for aws_assume_role_arn but transparent role-based authentication via EC2 metadata is broken after 1.5.0 as well. Unfortunately, I need to upgrade because of other bugs that are only fixed in later versions of the provider.

@phillbaker
Copy link
Owner

transparent role-based authentication via EC2 metadata is broken after 1.5.0

Hi @marksumm can you clarify exactly the method that's being used here? What environmental variables are set? What EC2 metadata is being used?

@marksumm
Copy link

marksumm commented Sep 23, 2021

@phillbaker I meant a situation where no authentication attributes or environment variables are passed to the provider, healthchecks are disabled, and AWS request signing is enabled. Running locally uses the AWS credentials file as expected, but running on an EC2 instance now hangs indefinitely because state refreshes for resources created using the provider never return. The EC2 instance has an assumed role and so a session token is available via the metadata endpoint. Everything described was working in 1.5.0.

@phillbaker
Copy link
Owner

phillbaker commented Sep 24, 2021

@marksumm please share the elasticsearch provider config that is working on 1.5.0 and not working in more recent versions. What url does the ES cluster have? And is it self hosted or in the AWS Elastic/Opensearch service?

@marksumm
Copy link

@phillbaker The provider is configured like this...

provider "elasticsearch" {
  url               = "https://********.us-east-1.es.amazonaws.com"
  sign_aws_requests = true
  healthcheck       = false
}

The endpoint is apparently Elasticsearch 7.7, but it seems that AWS have already started to make changes to the API following the switch to OpenSearch. For example, index patterns should now be nested inside ISM policies and not created as separate resources. By the way, I tried setting AWS_SDK_LOAD_CONFIG=1, but it didn't help.

@marksumm
Copy link

marksumm commented Sep 24, 2021

@phillbaker I've noticed that if I log in to an affected EC2 instance and target an individual resource created by this provider during terraform plan (and there are no dependencies on other resources), then the state refresh operation no longer hangs. If I attempt to target more than one resource created by this provider, or run an unmodified terraform plan, then I see the hanging behaviour as before. This is true even for a configuration with a very small number of resources (3), which seems to point to an internal deadlock, rather than an API limiting issue. Interestingly, setting -parallelism 1 doesn't seem to help.

phillbaker added a commit that referenced this issue Sep 25, 2021
@phillbaker
Copy link
Owner

Hi @marksumm this should be addressed in 64f21df, it'll be released in 2.0.0-beta.2 (coming shortly).

@marksumm
Copy link

@phillbaker It works! Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants