Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(aws assumerolewithwebidentity): fixed s3 access for ruler to use… #4738

Closed
wants to merge 1 commit into from
Closed

feat(aws assumerolewithwebidentity): fixed s3 access for ruler to use… #4738

wants to merge 1 commit into from

Conversation

blut
Copy link

@blut blut commented May 11, 2022

fixed s3 access for ruler to use assumerolewithwebidentity in an IRSA setup on AWS

This PR includes some code to use assume role with web identity and utilize standard env. variables to enable IRSA.

Which issue(s) this PR fixes:
Fixes 3740

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

… IRSA for assumerolewithwebidentity

Signed-off-by: Hannes Blut <hannes.blut-extern@deutschebahn.com>
(external expert on behalf of DB Netz AG)
@alvinlin123
Copy link
Contributor

The default credential resolver should support web identity already; I don't think we need to implement assume role with web identity manually here.

@blut
Copy link
Author

blut commented May 12, 2022

@alvinlin123, the aws client sdk implements the assumeRoleWithWebIdentity correctly, however as far as I could tell,
for the ruler and the alertmanager the sts assumeRoleWithWebIdentity calls are issued against the s3 endpoint instead of the STS endpoint.
Might this be due to an incomplete configuration on client initialization?

@alvinlin123
Copy link
Contributor

alvinlin123 commented May 12, 2022

Hmm this is weird, because we run alert manager and ruler using IRSA too, without any issue. We might need to dig deeper into what is happening for you. Do you have the latest error message?

Most like you are right, the client initialization may be in complement or there may be some other env var in play here. Would it be possible to maybe do a build with debug logging turned on for the session, and see what's going on?

I will do some code reading in the meanwhile.

@blut
Copy link
Author

blut commented May 12, 2022

I'll add a detailled bug report with debug logging tomorrow.

@alvinlin123
Copy link
Contributor

alvinlin123 commented May 13, 2022

@blut also, if you can post your alertmanager/ruler config (include s3 client) it may help me to troubleshoot :)

Also do you know if the environment you are running allows global STS endpoint (ttps://sts.amazonaws.com)? I had some customers hitting weird issues because their firewall/proxy don't allow the global STS endpoint. Would setting an env variable AWS_STS_REGIONAL_ENDPOINTS=regional be something you can test as well? I am scratching my head because I just doubled confirmed that my alertmanager and ruler env is using IRSA, and is not having any issue.

And it's not that I don't want to merge this PR, I am more worry about AWS SDK has a bug or something; that's why I appreciate your help on this :-)

@blut
Copy link
Author

blut commented Jun 10, 2022

Hi @alvinlin123
I've attached the debug.log, but I think the interesting message is the following error:

caused by: SerializationError: failed to unmarshal error message
        status code: 405, request id:
caused by: UnmarshalError: failed to unmarshal error message
        00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version=\"1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0\" encoding=\"UT|
00000020  46 2d 38 22 3f 3e 0a 3c  45 72 72 6f 72 3e 3c 43  |F-8\"?>.<Error><C|
00000030  6f 64 65 3e 4d 65 74 68  6f 64 4e 6f 74 41 6c 6c  |ode>MethodNotAll|
00000040  6f 77 65 64 3c 2f 43 6f  64 65 3e 3c 4d 65 73 73  |owed</Code><Mess|
00000050  61 67 65 3e 54 68 65 20  73 70 65 63 69 66 69 65  |age>The specifie|
00000060  64 20 6d 65 74 68 6f 64  20 69 73 20 6e 6f 74 20  |d method is not |
00000070  61 6c 6c 6f 77 65 64 20  61 67 61 69 6e 73 74 20  |allowed against |
00000080  74 68 69 73 20 72 65 73  6f 75 72 63 65 2e 3c 2f  |this resource.</|
00000090  4d 65 73 73 61 67 65 3e  3c 4d 65 74 68 6f 64 3e  |Message><Method>|
000000a0  50 4f 53 54 3c 2f 4d 65  74 68 6f 64 3e 3c 52 65  |POST</Method><Re|
000000b0  73 6f 75 72 63 65 54 79  70 65 3e 53 45 52 56 49  |sourceType>SERVI|
000000c0  43 45 3c 2f 52 65 73 6f  75 72 63 65 54 79 70 65  |CE</ResourceType|
000000d0  3e 3c 52 65 71 75 65 73  74 49 64 3e 59 4a 33 42  |><RequestId>YJ3B|
000000e0  43 37 4a 4a 47 56 34 37  45 45 59 45 3c 2f 52 65  |C7JJGV47EEYE</Re|
000000f0  71 75 65 73 74 49 64 3e  3c 48 6f 73 74 49 64 3e  |questId><HostId>|
00000100  55 55 77 71 55 70 51 54  74 6c 44 67 35 54 7a 2f  |UUwqUpQTtlDg5Tz/|
00000110  7a 55 42 57 2b 79 73 4f  55 36 75 67 53 2f 4d 6d  |zUBW+ysOU6ugS/Mm|
00000120  4e 2b 45 32 52 62 56 66  66 4b 47 72 56 65 31 5a  |N+E2RbVffKGrVe1Z|
00000130  7a 76 49 51 77 35 34 34  32 4f 4d 4f 47 77 37 73  |zvIQw5442OMOGw7s|
00000140  6c 2f 44 45 70 39 61 38  55 53 30 3d 3c 2f 48 6f  |l/DEp9a8US0=</Ho|
00000150  73 74 49 64 3e 3c 2f 45  72 72 6f 72 3e           |stId></Error>|

caused by: unknown error response tag, {{ Error} []}```

This error happend with the following configuration:
```      - args:
        - -log.level=debug
        - -api.response-compression-enabled=true
        - -blocks-storage.backend=s3
        - -blocks-storage.s3.bucket-name=cortex-storage-uash1kei
        - -blocks-storage.s3.endpoint=s3.eu-central-1.amazonaws.com
        - -consul.hostname=
        - -distributor.health-check-ingesters=true
        - -distributor.replication-factor=3
        - -distributor.shard-by-all-labels=true
        - -dynamodb.api-limit=10
        - -dynamodb.url=https://eu-central-1
        - -experimental.ruler.enable-api=true
        - -memberlist.abort-if-join-fails=false
        - -memberlist.bind-port=7946
        - -memberlist.join=gossip-ring.cortex.svc.cluster.local:7946
        - -querier.query-ingesters-within=13h
        - -querier.query-store-after=12h
        - -querier.store-gateway-addresses=store-gateway:9095
        - -ring.heartbeat-timeout=10m
        - -ring.prefix=
        - -ring.store=memberlist
        - -ruler-storage.backend=s3
        - -ruler-storage.s3.bucket-name=cortex-storage-uash1kei
        - -ruler-storage.s3.region=eu-central-1
        - -ruler.alertmanager-url=http://alertmanager.cortex.svc.cluster.local/alertmanager
        - -ruler.enable-sharding=true
        - -ruler.max-rule-groups-per-tenant=20
        - -ruler.max-rules-per-rule-group=15
        - -ruler.ring.consul.hostname=
        - -ruler.ring.store=memberlist
        - -ruler.storage.s3.buckets=cortex-storage-uash1kei
        - -ruler.storage.s3.endpoint=s3.eu-central-1.amazonaws.com
        - -consul.hostname=
        - -ruler.storage.s3.force-path-style=false
        - -ruler.storage.s3.region=eu-central-1
        - -ruler.storage.type=s3
        - -runtime-config.file=/etc/cortex/overrides.yaml
        - -s3.url=https://eu-central-1/cortex-storage-uash1kei
        - -schema-config-file=/etc/cortex/schema/config.yaml
        - -store.cardinality-limit=1000000
        - -store.engine=blocks
        - -store.max-query-length=768h
        - -target=ruler
        env:
        - name: AWS_STS_REGIONAL_ENDPOINT
          value: regional

As seen in the attached pod.yaml, the required AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE are also set through EKS.
I've added the AWS_STS_REGIONAL_ENDPOINT as suggested.

Edit: The firewall should not be an issue, since the ruler and all the other cortex components are deployed to the same cluster & nodes. The cortex components also share the same serviceaccount.

@blut
Copy link
Author

blut commented Jun 10, 2022

No description provided.

@alvinlin123
Copy link
Contributor

I will take a closer look. Thank you for getting back. I will take a look asap :)

@blut
Copy link
Author

blut commented Jun 23, 2022

I will take a closer look. Thank you for getting back. I will take a look asap :)

Hi @alvinlin123, did you find a chance to check out my configuration?

@alvinlin123
Copy link
Contributor

@blut I'll take a look today, forgot to ask which commit/version of Cortex you are using?

@alvinlin123
Copy link
Contributor

@blut I think I know what's going on. Can you remove the -ruler.storage.s3.endpoint=s3.eu-central-1.amazonaws.com config?

The config result in AWS SDK's WithEndpoint method method to be called, I vaguely remember that the endpoint set by WithEndpiont method are used for any calls, including calls to STS. This explains why you are seeing error message from S3 when calling STS with WebIdentity.

@blut
Copy link
Author

blut commented Jun 28, 2022

We're still on cortex v1.9.0, deployed to kubernetes using tanka.
Separately for the ruler deployment I've tried upgrading the image to v1.11.1 and with removed ruler.storage.s3.endpoint, I get a much funnier error:

level=error ts=2022-06-28T10:05:40.240948562Z caller=ruler.go:481 msg="unable to list rules" err="WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.dummy.amazonaws.com/\": dial tcp: lookup sts.dummy.amazonaws.com on 172.20.0.10:53: no such host"

It appears the region is defined somewhere separately.

@stale
Copy link

stale bot commented Oct 1, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 1, 2022
@alvinlin123
Copy link
Contributor

@blut do you still have same issue?

@alvinlin123 alvinlin123 removed the stale label Oct 1, 2022
@blut
Copy link
Author

blut commented Oct 1, 2022

Hi @alvinlin123, we've switched to Mimir, where this issue is resolved. Feel free to close.

@blut blut closed this Oct 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants