Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[publick8s] AzureAD / AKS error Authorization Failures have been detected that may affect cluster availability over outbound IPv6 addresses #4206

Closed
dduportal opened this issue Jul 31, 2024 · 6 comments

Comments

@dduportal
Copy link
Contributor

Service(s)

Azure, Other

Summary

The Azure Health Monitor (in Azure UI, browse to the AKS cluster, select 'Diagnose and solve problems' -> section 'Cluster and Control Plane Availability and Performance') alerts us about the following error in the cluster publick8s:

The client '4bc84fd5-6d28-415b-b40e-ba1466c05f40' with object id '4bc84fd5-6d28-415b-b40e-ba1466c05f40' does not have authorization to perform action 'Microsoft.Network/publicIPAddresses/write' over scope '/subscriptions/dff2ec18-6a8e-405c-8e45-b7df7465acf0/resourceGroups/prod-public-ips/providers/Microsoft.Network/publicIPAddresses/kubernetes-a87019483466d4648b9aa56d99ea6330-IPv6' or the scope is invalid. If access was recently granted, please refresh your credentials. 
The client '4bc84fd5-6d28-415b-b40e-ba1466c05f40' with object id '4bc84fd5-6d28-415b-b40e-ba1466c05f40' does not have authorization to perform action 'Microsoft.Network/publicIPAddresses/write' over scope '/subscriptions/dff2ec18-6a8e-405c-8e45-b7df7465acf0/resourceGroups/prod-public-ips/providers/Microsoft.Network/publicIPAddresses/kubernetes-afe471e76397e44338e1a642f52be222-IPv6' or the scope is invalid. If access was recently granted, please refresh your credentials. 

We need to check:

  • Do we still need these 2 outbound IPv6?
    • If no, then can we disable them?
      • if no, then we need to fix the permissions
      • if yes then we have to disable/remove these IPs
    • If yes, then we need to fix the permissions

Reproduction steps

No response

@dduportal dduportal added the triage Incoming issues that need review label Jul 31, 2024
@dduportal dduportal added this to the infra-team-sync-2024-08-13 milestone Jul 31, 2024
@dduportal dduportal self-assigned this Jul 31, 2024
@dduportal dduportal removed the triage Incoming issues that need review label Jul 31, 2024
@dduportal
Copy link
Contributor Author

After a quick analysis: it appears that specifying a custom resource group for the public IPs of our public LBs (ref. https://github.com/jenkins-infra/kubernetes-management/blob/4f2bc39e9554bf45463e32f1c1dc69f7aa03539f/config/public-nginx-ingress_publick8s.yaml#L19) is also applied to the outbound IPs of the loadbalancer (see https://github.com/jenkins-infra/azure/blob/e8cbf0146eb1a033a7c93c65a5c7875570f3d6f3/publick8s.tf#L68-L73) when it's used as egress/outbound method.

To solve these errors, we either:

@dduportal
Copy link
Contributor Author

After few more investigations, it appears that these 2 IPv6 looks like leftovers of past SNAT issues: #3908 (comment)

Since we changed the cluster setup to define explicit outbound IPv6 in jenkins-infra/azure@f3ba3d8, these 2 old IPv6 are not required anymore

WiP: deletion of this 2 IPv6s

@dduportal
Copy link
Contributor Author

The 2 IPv6 have been disassociated from the publick8s's LB and removed

@dduportal
Copy link
Contributor Author

dduportal commented Aug 6, 2024

Note: the 2 faulty IPv6s have been removed but it looks like a reconciliation needs to happen.
As such, I've opened jenkins-infra/azure#795 and we'll wait 24 hours.

Sounds like a weird bug in AKS :|

dduportal added a commit to jenkins-infra/azure that referenced this issue Aug 6, 2024
Related to jenkins-infra/helpdesk#4206

This PR allows the cluster to fully manages IPs in this RG.

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

dduportal commented Aug 6, 2024

Update: it is NOT an Azure issue. It is a configuration issue from us: we were using DualStack LB Kubernetes services which had this unexpected behaviors...

Ref. https://kubernetes.io/docs/concepts/services-networking/dual-stack/

@dduportal
Copy link
Contributor Author

Update: with the SingleStack fixes, no more errors \o/

Capture d’écran 2024-08-06 à 19 28 40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant