AWS SAML connection error on AWS ElasticSearch cluster since upgrading #189

raids · 2021-06-24T16:26:54Z

I'm seeing similar behaviour to #183, but it not exactly the same.

I've recently upgraded from tf 0.13 to 1.0. Since upgrading and migrating the state, this provider doesn't seem able to authenticate with a cluster which it previously was managing (elasticsearch_opendistro_roles and mappings, elasticsearch_index_templates, etc.)

terraform plan gives me one of these for each elasticsearch provider resource:

Error: health check timeout: Head "https://random-aws-url.aws-region.es.amazonaws.com": RequestCanceled: request context canceled
│ caused by: context deadline exceeded: no Elasticsearch node available

The cluster had SAML authentication turned out after it was created, but even since then, there have been hundreds of plans and applys without issue. I was previously on provider version 1.5.1 because of a regression blocking use of IAM roles (#149).

I remained on provider version 1.5.1 after upgrading terraform, and when I saw these errors, I tried the latest version 1.5.7 and also 1.5.0 (as referenced in other issues), but it doesn't help.

Let me know if I can do anything to help debug / troubleshoot with a cluster in AWS. Thanks!

Edit: Also, perhaps worth noting that the suggestions from #183 to set sign_aws_requests = false and insecure = true don't help (not that I'd want to set them long term 😁.

The text was updated successfully, but these errors were encountered:

phillbaker · 2021-06-25T01:25:19Z

Hello, can you please include the following:

elasticsearch version (and opendistro version if relevant)
redacted version of the terraform resource configuration
terraform logs by setting TF_LOG=info

raids · 2021-06-25T07:29:58Z

Provider version 1.5.1 with the following config (elasticsearch_version is same version as what's used in my aws provider's aws_elasticsearch_domain resource). Provider version 1.5.0 gives basically the same logs too.

provider elasticsearch {
  url                 = "https://${aws_elasticsearch_domain.domain.endpoint}"
  sign_aws_requests   = true
  aws_assume_role_arn = "arn:aws:iam::account-id:role/deployment-role-name"
  elasticsearch_version = "7.9"
}

Some resources I'm trying to manage with the elasticsearch provider

ism policy and index template

resource elasticsearch_opendistro_ism_policy delete_after_30d {
  policy_id = "delete_after_30d"
  body      = <<EOF
{
  "policy": {
    "description": "Delete indices older than 30 days",
    "default_state": "hot",
    "schema_version": 1,
    "states": [
      {
        "name": "hot",
        "actions": [],
        "transitions": [
          {
            "state_name": "delete",
            "conditions": {
              "min_index_age": "30d"
            }
          }
        ]
      },
      {
        "name": "delete",
        "actions": [
          {
            "delete": {}
          }
        ],
        "transitions": []
      }
    ]
  }
}
EOF
}

resource elasticsearch_index_template fluent_bit {
  name = "fluent-bit-template"
  body = <<EOF
{
  "index_patterns": [
    "logstash*"
  ],
  "settings": {
    "index": {
      "opendistro": {
        "index_state_management": {
          "policy_id": "${elasticsearch_opendistro_ism_policy.delete_after_30d.id}"
        }
      }
    }
  }
}
EOF
}

some permissions management

resource "elasticsearch_opendistro_roles_mapping" "fluent_bit_write" {
  role_name   = "logstash"
  description = "Allow fluent-bit pods to forward logs to ElasticSearch."
  backend_roles = [
    aws_iam_role.monitoring_fluent_bit_role.arn,
    "arn:aws:iam::account-id:role/*",
  ]
}

resource "elasticsearch_opendistro_roles_mapping" "all_access" {
  role_name   = "all_access"
  description = "Allow all actions."
  backend_roles = [
    "SSO Group Name",
    "arn:aws:iam::account-id:role/AWSReservedSSO_SSO_ROLE_NAME", # AWS SSO provisioned role
    "arn:aws:iam::account-id:role/deployment-role-name",
  ]
}

resource "elasticsearch_opendistro_role" "readall_and_monitor_global" {
  role_name   = "readall_and_monitor_global"
  description = "readall_and_monitor with access to the global tenant"

  cluster_permissions = ["cluster_monitor", "cluster_composite_ops_ro"]

  index_permissions {
    index_patterns  = ["*"]
    allowed_actions = ["read", "indices_monitor"]
  }

  tenant_permissions {
    tenant_patterns = ["global_tenant"]
    allowed_actions = ["kibana_all_read"]
  }
}

resource "elasticsearch_opendistro_roles_mapping" "readall_and_monitor_global" {
  role_name   = "readall_and_monitor_global"
  description = "Allow read only and monitor actions on the global tenant."
  backend_roles = [
    "SSO Group Name",
  ]
}

I didn't find anything particularly interesting in the logs with TF_LOG=info, but with TF_LOG=trace I see the following

21-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.all_access": visit complete
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.all_access": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.all_access": visit complete
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.all_access (expand)": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.all_access (expand)": visit complete
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_ism_policy.delete_after_30d": visit complete
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_ism_policy.delete_after_30d": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_ism_policy.delete_after_30d": visit complete
2021-06-25T08:07:32.649+0100 [TRACE] vertex "elasticsearch_opendistro_ism_policy.delete_after_30d (expand)": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.649+0100 [TRACE] vertex "elasticsearch_opendistro_ism_policy.delete_after_30d (expand)": visit complete
2021-06-25T08:07:32.649+0100 [TRACE] dag/walk: upstream of "elasticsearch_index_template.fluent_bit (expand)" errored, so skipping
2021-06-25T08:07:32.649+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.readall_and_monitor_global": visit complete
2021-06-25T08:07:32.649+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.readall_and_monitor_global": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.649+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.readall_and_monitor_global": visit complete
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.readall_and_monitor_global (expand)": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.readall_and_monitor_global (expand)": visit complete
2021-06-25T08:07:32.650+0100 [TRACE] vertex "module.ops_cluster.aws_iam_instance_profile.workers_launch_template": entering dynamic subgraph
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_role.readall_and_monitor_global": visit complete
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_role.readall_and_monitor_global": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_role.readall_and_monitor_global": visit complete
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_role.readall_and_monitor_global (expand)": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.651+0100 [TRACE] vertex "elasticsearch_opendistro_role.readall_and_monitor_global (expand)": visit complete

timcosta · 2021-06-25T12:52:47Z

I've seen this a lot in the past few days, BUT it's always been a configuration issue on my end. Try applying your authn/authz terraform using -target, i think your saml auth config or master role ARN are likely out of sync. If that isn't terraformed, go to Modify Authentication in the AWS console and fill out everything again. This has happened to me when I accidentally dropped the assumed role's permissions from ES.

tldr: The way it says "you dont have permission" is by saying "no Elasticsearch node available".

raids · 2021-06-25T15:54:23Z

I spent some of today stepping through all of this to try and get to the bottom of the issue, and I think aws_assume_role_arn in my provider config isn't being respected.

# this doesn't work
provider elasticsearch {
  url                 = "https://${aws_elasticsearch_domain.domain.endpoint}"
  sign_aws_requests   = true
  aws_assume_role_arn = "arn:aws:iam::account-id:role/my-deployment-role"
}

As mentioned, the above provider config gives me the no Elasticsearch node available. However, if I manually assume the role using aws sts assume-role --role-arn arn:aws:iam::account-id:role/my-deployment-role --role-session-name session-name and then plug the returned credentials into the provider, as below, then I can run a terraform plan with no issue.

# this works
provider elasticsearch {
  url                 = "https://${aws_elasticsearch_domain.domain.endpoint}"
  sign_aws_requests   = true
  aws_access_key = "XXXXXXXXXXXXX"
  aws_secret_key = "XXXXXXXXXXXXX"
  aws_token = "XXXXXXXXXXXXX"
}

Perhaps this is similar to #124? I'm using provider version 1.5.7, so perhaps there's another case not accounted for.

raids · 2021-06-25T16:34:28Z

Also it might be worthwhile to note that I'm using an SSO profile by doing AWS_PROFILE=sso-profile terraform plan

[profile sso-profile]
sso_start_url = https://mycompany.awsapps.com/start/
sso_region = eu-west-1
sso_account_id = account-id
sso_role_name = role-name
region = eu-west-1
output = json

I used the same profile to assume role with, so it's definitely got the right permissions for that.

phillbaker · 2021-06-28T00:17:08Z

Hi @raids, sorry to hear that this broken unexpectedly! Thanks for providing details and debugging.

I've recently upgraded from tf 0.13 to 1.0. Since upgrading and migrating the state, this provider doesn't seem able to authenticate with a cluster which it previously was managing

I'm not sure why it would matter, but would you be able to try downgrading terraform? If the terraform version change started the issue, perhaps reverting would fix it?

if I manually assume the role using aws sts assume-role --role-arn arn:aws:iam::account-id:role/my-deployment-role --role-session-name session-name and then plug the returned credentials into the provider, as below, then I can run a terraform plan with no issue.

This definitely points to an issue - have you tried specifying the profile via the aws_profile option in the provider config?

Can you confirm what file the profile information is stored in on disk?

raids · 2021-06-30T08:16:29Z

Hi @phillbaker - apologies, I haven't had much time to test this further. Some answers to your questions and then some good news(?) below.

I'm not sure why it would matter, but would you be able to try downgrading terraform? If the terraform version change started the issue, perhaps reverting would fix it?

I went ahead with the Terraform 1.0 upgrade as a priority, so running again on a downgraded version of Terraform isn't an easy option for me right now.

This definitely points to an issue - have you tried specifying the profile via the aws_profile option in the provider config?

We can't use aws_profile in our provider config as it's not aligned with how our CI deploys, so including it in the templates is a not an option.

Can you confirm what file the profile information is stored in on disk?

The SSO profile information is stored in ~/.aws/config.

The good news is that I think I know where the issue stems from, and it's not from upgrading provider version:

I didn't emphasiste that when I migrated the state during the upgrade, it wasn't just from one version of Terraform to another, it was also to a different S3 bucket in a different account which I must use an SSO profile/role to access
After having a quick look at AWS Go SDK's support for SSO profiles, I found an AWS blog post with the following snippet:
```
  sess, err := session.NewSessionWithOptions(session.Options{
      SharedConfigState: session.SharedConfigEnable, // Must be set to enable
      Profile:           "dev-profile",
  })
```
I tested that snippet locally (i.e. with AWS_PROFILE=profile-name go run main.go) and it worked as expected. Without SharedConfigState set, it would not work.

I poked around a bit, and after seeing the comment description of the session.SharedConfigEnable const, I tried setting the env var AWS_SDK_LOAD_CONFIG=1 instead, which worked. I imagine this would also work with my terraform plans too, and I'll try to test this week when I have some time. From what I can see, this provider isn't setting that option, so I'm not sure how or if you want to handle this. For now, I'm content with setting that environment variable as it would only for local dev interaction with Terraform, not our CI pipeline which is a good old fashioned IAM user assuming roles.

phillbaker · 2021-07-03T23:30:35Z

Thanks @raids, the suggestion to set SharedConfigState: session.SharedConfigEnable looks very similar to what the terraform-aws-provider does (in its upstream auth library): hashicorp/aws-sdk-go-base#38. I'll add a commit to do that as it looks pretty straightforward, without downsides.

phillbaker · 2021-07-07T02:09:47Z

Done in 72475b5

phillbaker changed the title ~~Can't authenticate against AWS ElasticSearch cluster since upgrading~~ AWS SAML connection error on AWS ElasticSearch cluster since upgrading Jul 3, 2021

phillbaker mentioned this issue Jul 3, 2021

Connection error when using AWS ES with enabled SAML #183

Closed

phillbaker closed this as completed Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS SAML connection error on AWS ElasticSearch cluster since upgrading #189

AWS SAML connection error on AWS ElasticSearch cluster since upgrading #189

raids commented Jun 24, 2021 •

edited

Loading

phillbaker commented Jun 25, 2021

raids commented Jun 25, 2021

timcosta commented Jun 25, 2021

raids commented Jun 25, 2021

raids commented Jun 25, 2021

phillbaker commented Jun 28, 2021

raids commented Jun 30, 2021

phillbaker commented Jul 3, 2021

phillbaker commented Jul 7, 2021

AWS SAML connection error on AWS ElasticSearch cluster since upgrading #189

AWS SAML connection error on AWS ElasticSearch cluster since upgrading #189

Comments

raids commented Jun 24, 2021 • edited Loading

phillbaker commented Jun 25, 2021

raids commented Jun 25, 2021

timcosta commented Jun 25, 2021

raids commented Jun 25, 2021

raids commented Jun 25, 2021

phillbaker commented Jun 28, 2021

raids commented Jun 30, 2021

phillbaker commented Jul 3, 2021

phillbaker commented Jul 7, 2021

raids commented Jun 24, 2021 •

edited

Loading