Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Pods not resolving DNS names #5099

Closed
marciogmorales opened this issue Nov 15, 2023 · 3 comments · Fixed by #5132
Closed

Windows Pods not resolving DNS names #5099

marciogmorales opened this issue Nov 15, 2023 · 3 comments · Fixed by #5132
Labels
bug Something isn't working burning Time sensitive issues

Comments

@marciogmorales
Copy link

marciogmorales commented Nov 15, 2023

Description

Observed Behavior:

Windows pods aren't resolving DNS names, despite Windows Server version 2019 or 2022, whereas Linux ones are. I narrowed down this issue, and it only occurs when Windows nodes are deployed via Karpenter; the same doesn't happen when nodes are deployed via EKS Node Groups. By the nslookup test, it sounds like Windows pod isn't able to find the DNS Server (CoreDNS). I believe it is something to do with the Windows node EKS bootstrap, but I wasn't able to find the root cause.

NSLOOKUP from a Linux Pod (DNS server is found)

PS C:\kube\Apps> kubectl exec -i -t dnsutils -- nslookup www.google.com
Server:         10.100.0.10
Address:        10.100.0.10#53

Non-authoritative answer:
Name:   www.google.com
Address: 142.251.167.147
Name:   www.google.com
Address: 142.251.167.104
Name:   www.google.com
Address: 142.251.167.99
Name:   www.google.com
Address: 142.251.167.106
Name:   www.google.com
Address: 142.251.167.105
Name:   www.google.com
Address: 142.251.167.103

NSLOOKUP from a Windows Pod (DNS server is not found)

PS C:\kube\Apps> kubectl exec -i -t amazon-eks-gmsa-domainless-2022-699fbdf69d-rtdlr -- nslookup www.google.com
Server:  UnKnown
Address:  10.100.0.10

DNS request timed out.
    timeout was 2 seconds.
Server:  UnKnown
Address:  10.100.0.10

DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
*** Request to UnKnown timed-out
PS C:\kube\Apps> 

Expected Behavior:

Windows pods resolving DNS names.

Reproduction Steps (Please include YAML):

1 - Deploy EKS cluster and Karpenter using the Karpenter Docs guideline , deploy NodeClass and NodePool and run nslookup on the pod.

---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: windows2022
  annotations:
    kubernetes.io/description: "General purpose NodePool for Windows workloads"
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/os
          operator: In
          values: ["windows"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["2"]
      nodeClassRef:
        name: windows2022
      taints:
      - key: os/windows2022
        effect: NoSchedule
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h # 30 * 24h = 720h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: windows2022
  annotations:
    kubernetes.io/description: "Nodes running Windows Server 2022"
spec:
  amiFamily: Windows2022
  role: "KarpenterNodeRole-eks-cluster" # replace with your cluster name
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "eks-cluster" # replace with your cluster name
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "eks-cluster" # replace with your cluster name
  metadataOptions:
    httpProtocolIPv6: disabled
    httpTokens: required
  detailedMonitoring: true

Versions:

  • EKS: 1.28
  • AMI: Windows_Server-2019-English-Core-EKS_Optimized-1.28-2023.10.19
  • AMI: Windows_Server-2022-English-Core-EKS_Optimized-1.28-2023.10.19
  • Chart Version:
  • Kubernetes Version (kubectl version): 1.28.3
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@marciogmorales marciogmorales added the bug Something isn't working label Nov 15, 2023
@tzneal
Copy link
Contributor

tzneal commented Nov 16, 2023

Setting the service CIDR should resolve this as identified in #4088:

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
...
spec:
  userData: |
    [System.Environment]::SetEnvironmentVariable('SERVICE_IPV4_CIDR', '10.100.0.0/16', 'Machine')
...

@marciogmorales
Copy link
Author

Setting the service CIDR should resolve this as identified in #4088:

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
...
spec:
  userData: |
    [System.Environment]::SetEnvironmentVariable('SERVICE_IPV4_CIDR', '10.100.0.0/16', 'Machine')
...

Already did, same issue. On 0.32v it needs to be on NodeClass.

@jonathan-innis jonathan-innis added the burning Time sensitive issues label Nov 21, 2023
@tzifudzi
Copy link
Contributor

tzifudzi commented Nov 21, 2023

Hello Marcio, thanks for bringing attention to this.

  • After taking a look, the root cause of this issue is that if a Windows node is created whereby it the instance profile is not mapped to the RBAC group eks:kube-proxy-windows, the node will not have the necessary RBAC permissions to fetch the cluster services and endpoints.
  • This means the kube-proxy binary running on the Windows node won't be aware of the services in the cluster, and will never configure the network rules for service access, and therefore the DNS resolution will not work since DNS resolution from the pods relies on the kube-dns service that is created to front CoreDNS.
  • To fix this, it is necessary to specify the RBAC group mapping eks:kube-proxy-windows when creating the IAM instance profile Karpenter role which is used by the Windows nodes. Will be updating the docs both in Karpenter and in EKS to be more specific and clear about this as this was previosly not mentioned. Will link the docs change PR when it is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working burning Time sensitive issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants