Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS SAML connection error on AWS ElasticSearch cluster since upgrading #189

Closed
raids opened this issue Jun 24, 2021 · 9 comments
Closed

Comments

@raids
Copy link

raids commented Jun 24, 2021

I'm seeing similar behaviour to #183, but it not exactly the same.

I've recently upgraded from tf 0.13 to 1.0. Since upgrading and migrating the state, this provider doesn't seem able to authenticate with a cluster which it previously was managing (elasticsearch_opendistro_roles and mappings, elasticsearch_index_templates, etc.)

terraform plan gives me one of these for each elasticsearch provider resource:

Error: health check timeout: Head "https://random-aws-url.aws-region.es.amazonaws.com": RequestCanceled: request context canceled
│ caused by: context deadline exceeded: no Elasticsearch node available

The cluster had SAML authentication turned out after it was created, but even since then, there have been hundreds of plans and applys without issue. I was previously on provider version 1.5.1 because of a regression blocking use of IAM roles (#149).

I remained on provider version 1.5.1 after upgrading terraform, and when I saw these errors, I tried the latest version 1.5.7 and also 1.5.0 (as referenced in other issues), but it doesn't help.

Let me know if I can do anything to help debug / troubleshoot with a cluster in AWS. Thanks!

Edit: Also, perhaps worth noting that the suggestions from #183 to set sign_aws_requests = false and insecure = true don't help (not that I'd want to set them long term 😁.

@phillbaker
Copy link
Owner

Hello, can you please include the following:

  • elasticsearch version (and opendistro version if relevant)
  • redacted version of the terraform resource configuration
  • terraform logs by setting TF_LOG=info

@raids
Copy link
Author

raids commented Jun 25, 2021

Provider version 1.5.1 with the following config (elasticsearch_version is same version as what's used in my aws provider's aws_elasticsearch_domain resource). Provider version 1.5.0 gives basically the same logs too.

provider elasticsearch {
  url                 = "https://${aws_elasticsearch_domain.domain.endpoint}"
  sign_aws_requests   = true
  aws_assume_role_arn = "arn:aws:iam::account-id:role/deployment-role-name"
  elasticsearch_version = "7.9"
}

Some resources I'm trying to manage with the elasticsearch provider

  • ism policy and index template
resource elasticsearch_opendistro_ism_policy delete_after_30d {
  policy_id = "delete_after_30d"
  body      = <<EOF
{
  "policy": {
    "description": "Delete indices older than 30 days",
    "default_state": "hot",
    "schema_version": 1,
    "states": [
      {
        "name": "hot",
        "actions": [],
        "transitions": [
          {
            "state_name": "delete",
            "conditions": {
              "min_index_age": "30d"
            }
          }
        ]
      },
      {
        "name": "delete",
        "actions": [
          {
            "delete": {}
          }
        ],
        "transitions": []
      }
    ]
  }
}
EOF
}

resource elasticsearch_index_template fluent_bit {
  name = "fluent-bit-template"
  body = <<EOF
{
  "index_patterns": [
    "logstash*"
  ],
  "settings": {
    "index": {
      "opendistro": {
        "index_state_management": {
          "policy_id": "${elasticsearch_opendistro_ism_policy.delete_after_30d.id}"
        }
      }
    }
  }
}
EOF
}
  • some permissions management
resource "elasticsearch_opendistro_roles_mapping" "fluent_bit_write" {
  role_name   = "logstash"
  description = "Allow fluent-bit pods to forward logs to ElasticSearch."
  backend_roles = [
    aws_iam_role.monitoring_fluent_bit_role.arn,
    "arn:aws:iam::account-id:role/*",
  ]
}

resource "elasticsearch_opendistro_roles_mapping" "all_access" {
  role_name   = "all_access"
  description = "Allow all actions."
  backend_roles = [
    "SSO Group Name",
    "arn:aws:iam::account-id:role/AWSReservedSSO_SSO_ROLE_NAME", # AWS SSO provisioned role
    "arn:aws:iam::account-id:role/deployment-role-name",
  ]
}

resource "elasticsearch_opendistro_role" "readall_and_monitor_global" {
  role_name   = "readall_and_monitor_global"
  description = "readall_and_monitor with access to the global tenant"

  cluster_permissions = ["cluster_monitor", "cluster_composite_ops_ro"]

  index_permissions {
    index_patterns  = ["*"]
    allowed_actions = ["read", "indices_monitor"]
  }

  tenant_permissions {
    tenant_patterns = ["global_tenant"]
    allowed_actions = ["kibana_all_read"]
  }
}

resource "elasticsearch_opendistro_roles_mapping" "readall_and_monitor_global" {
  role_name   = "readall_and_monitor_global"
  description = "Allow read only and monitor actions on the global tenant."
  backend_roles = [
    "SSO Group Name",
  ]
}

I didn't find anything particularly interesting in the logs with TF_LOG=info, but with TF_LOG=trace I see the following

21-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.all_access": visit complete
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.all_access": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.all_access": visit complete
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.all_access (expand)": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.all_access (expand)": visit complete
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_ism_policy.delete_after_30d": visit complete
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_ism_policy.delete_after_30d": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.648+0100 [TRACE] vertex "elasticsearch_opendistro_ism_policy.delete_after_30d": visit complete
2021-06-25T08:07:32.649+0100 [TRACE] vertex "elasticsearch_opendistro_ism_policy.delete_after_30d (expand)": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.649+0100 [TRACE] vertex "elasticsearch_opendistro_ism_policy.delete_after_30d (expand)": visit complete
2021-06-25T08:07:32.649+0100 [TRACE] dag/walk: upstream of "elasticsearch_index_template.fluent_bit (expand)" errored, so skipping
2021-06-25T08:07:32.649+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.readall_and_monitor_global": visit complete
2021-06-25T08:07:32.649+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.readall_and_monitor_global": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.649+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.readall_and_monitor_global": visit complete
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.readall_and_monitor_global (expand)": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_roles_mapping.readall_and_monitor_global (expand)": visit complete
2021-06-25T08:07:32.650+0100 [TRACE] vertex "module.ops_cluster.aws_iam_instance_profile.workers_launch_template": entering dynamic subgraph
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_role.readall_and_monitor_global": visit complete
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_role.readall_and_monitor_global": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_role.readall_and_monitor_global": visit complete
2021-06-25T08:07:32.650+0100 [TRACE] vertex "elasticsearch_opendistro_role.readall_and_monitor_global (expand)": dynamic subgraph encountered errors: health check timeout: Head "https://random-aws-subdomain.region.es.amazonaws.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
2021-06-25T08:07:32.651+0100 [TRACE] vertex "elasticsearch_opendistro_role.readall_and_monitor_global (expand)": visit complete

@timcosta
Copy link

I've seen this a lot in the past few days, BUT it's always been a configuration issue on my end. Try applying your authn/authz terraform using -target, i think your saml auth config or master role ARN are likely out of sync. If that isn't terraformed, go to Modify Authentication in the AWS console and fill out everything again. This has happened to me when I accidentally dropped the assumed role's permissions from ES.

tldr: The way it says "you dont have permission" is by saying "no Elasticsearch node available".

@raids
Copy link
Author

raids commented Jun 25, 2021

I spent some of today stepping through all of this to try and get to the bottom of the issue, and I think aws_assume_role_arn in my provider config isn't being respected.

# this doesn't work
provider elasticsearch {
  url                 = "https://${aws_elasticsearch_domain.domain.endpoint}"
  sign_aws_requests   = true
  aws_assume_role_arn = "arn:aws:iam::account-id:role/my-deployment-role"
}

As mentioned, the above provider config gives me the no Elasticsearch node available. However, if I manually assume the role using aws sts assume-role --role-arn arn:aws:iam::account-id:role/my-deployment-role --role-session-name session-name and then plug the returned credentials into the provider, as below, then I can run a terraform plan with no issue.

# this works
provider elasticsearch {
  url                 = "https://${aws_elasticsearch_domain.domain.endpoint}"
  sign_aws_requests   = true
  aws_access_key = "XXXXXXXXXXXXX"
  aws_secret_key = "XXXXXXXXXXXXX"
  aws_token = "XXXXXXXXXXXXX"
}

Perhaps this is similar to #124? I'm using provider version 1.5.7, so perhaps there's another case not accounted for.

@raids
Copy link
Author

raids commented Jun 25, 2021

Also it might be worthwhile to note that I'm using an SSO profile by doing AWS_PROFILE=sso-profile terraform plan

[profile sso-profile]
sso_start_url = https://mycompany.awsapps.com/start/
sso_region = eu-west-1
sso_account_id = account-id
sso_role_name = role-name
region = eu-west-1
output = json

I used the same profile to assume role with, so it's definitely got the right permissions for that.

@phillbaker
Copy link
Owner

Hi @raids, sorry to hear that this broken unexpectedly! Thanks for providing details and debugging.

I've recently upgraded from tf 0.13 to 1.0. Since upgrading and migrating the state, this provider doesn't seem able to authenticate with a cluster which it previously was managing

I'm not sure why it would matter, but would you be able to try downgrading terraform? If the terraform version change started the issue, perhaps reverting would fix it?

if I manually assume the role using aws sts assume-role --role-arn arn:aws:iam::account-id:role/my-deployment-role --role-session-name session-name and then plug the returned credentials into the provider, as below, then I can run a terraform plan with no issue.

This definitely points to an issue - have you tried specifying the profile via the aws_profile option in the provider config?

Can you confirm what file the profile information is stored in on disk?

@raids
Copy link
Author

raids commented Jun 30, 2021

Hi @phillbaker - apologies, I haven't had much time to test this further. Some answers to your questions and then some good news(?) below.

I'm not sure why it would matter, but would you be able to try downgrading terraform? If the terraform version change started the issue, perhaps reverting would fix it?

I went ahead with the Terraform 1.0 upgrade as a priority, so running again on a downgraded version of Terraform isn't an easy option for me right now.

This definitely points to an issue - have you tried specifying the profile via the aws_profile option in the provider config?

We can't use aws_profile in our provider config as it's not aligned with how our CI deploys, so including it in the templates is a not an option.

Can you confirm what file the profile information is stored in on disk?

The SSO profile information is stored in ~/.aws/config.

The good news is that I think I know where the issue stems from, and it's not from upgrading provider version:

  • I didn't emphasiste that when I migrated the state during the upgrade, it wasn't just from one version of Terraform to another, it was also to a different S3 bucket in a different account which I must use an SSO profile/role to access

  • After having a quick look at AWS Go SDK's support for SSO profiles, I found an AWS blog post with the following snippet:

      sess, err := session.NewSessionWithOptions(session.Options{
          SharedConfigState: session.SharedConfigEnable, // Must be set to enable
          Profile:           "dev-profile",
      })
    

    I tested that snippet locally (i.e. with AWS_PROFILE=profile-name go run main.go) and it worked as expected. Without SharedConfigState set, it would not work.

    I poked around a bit, and after seeing the comment description of the session.SharedConfigEnable const, I tried setting the env var AWS_SDK_LOAD_CONFIG=1 instead, which worked. I imagine this would also work with my terraform plans too, and I'll try to test this week when I have some time. From what I can see, this provider isn't setting that option, so I'm not sure how or if you want to handle this. For now, I'm content with setting that environment variable as it would only for local dev interaction with Terraform, not our CI pipeline which is a good old fashioned IAM user assuming roles.

@phillbaker phillbaker changed the title Can't authenticate against AWS ElasticSearch cluster since upgrading AWS SAML connection error on AWS ElasticSearch cluster since upgrading Jul 3, 2021
@phillbaker
Copy link
Owner

Thanks @raids, the suggestion to set SharedConfigState: session.SharedConfigEnable looks very similar to what the terraform-aws-provider does (in its upstream auth library): hashicorp/aws-sdk-go-base#38. I'll add a commit to do that as it looks pretty straightforward, without downsides.

@phillbaker
Copy link
Owner

Done in 72475b5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants