Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod to service connectivity issues on August and September cumulative updates on Windows Server 2019 #61

Closed
daschott opened this issue Oct 6, 2020 · 23 comments
Assignees
Labels
bug Something isn't working Networking Connectivity and network infrastructure

Comments

@daschott
Copy link

daschott commented Oct 6, 2020

In Kubernetes on Windows Server, DECAP_IN VFP layer gets dropped on 8C and 9B cumulative updates on Windows Server 2019 when HNS service gets restarted. That may cause pod -> service traffic to fail in some cases and configurations.

If a Windows Server 2019 machine needs to be restarted, a workaround to try out is to:

pause kube-proxy 
Get-HNSPolicyList | Remove-HNSPolicyList
<restart>

This regression will be resolved in the cumulative update being released in 3rd week of October. This issue will not surface on Windows Server 1903 and above.

@ghost ghost added the triage New and needs attention label Oct 6, 2020
@vrapolinario vrapolinario added bug Something isn't working Networking Connectivity and network infrastructure and removed triage New and needs attention labels Oct 6, 2020
@daschott
Copy link
Author

The ETA for the fixing patch is next week. For those impacted users that need a (test-signed!) fix ahead of next week, please reach out to your Microsoft customer support contact to request such.

@llyons
Copy link

llyons commented Oct 13, 2020

when you say

pause kube-proxy

what are you meaning there. Pause just puts up a prompt to press any key to continue.

does mean reboot the system?

@daschott
Copy link
Author

What I mean, is to stop the kube-proxy such that no HNS policies are being re-created after you ran Get-HNSPolicyList | Remove-HNSPolicyList as a result of kube-proxy actively running on the system.

@vitaliy-leschenko
Copy link

KB4577668 has been released on October 13, 2020 (OS Build 17763.1518).
Does it fix the issue?

@daschott
Copy link
Author

No, this still has the issue. The next KB should have the fix. ETA is October 20th

@vitaliy-leschenko
Copy link

Hi again,
KB4580390 has been released on October 20, 2020 (OS Build 17763.1554).
Does it fix the issue?

At page (https://support.microsoft.com/en-us/help/4580390/windows-10-update-kb4580390) it marked as Preview. Is it final KB or we need to wait another one?

@daschott
Copy link
Author

@vitaliy-leschenko

For the "preview" naming please see here:
https://techcommunity.microsoft.com/t5/windows-it-pro-blog/resuming-optional-windows-10-and-windows-server-non-security/ba-p/1471429

It is the final production KB but optional, meaning users need to seek it out. It will also be included in the next month's "B" release update, hence the name "preview"...

This KB contains the fix for this issue. Can you try it out and confirm whether the issue still reproduces or not?

@vitaliy-leschenko
Copy link

vitaliy-leschenko commented Oct 22, 2020

I confirm that my nodes that were updated to 10.0.17763.1554 have no pod to services connectivity issue.
All my nodes were updated manually from 10.0.17763.1294 via install the msu (http://download.windowsupdate.com/c/msdownload/update/software/updt/2020/10/windows10.0-kb4580390-x64_743bc31f33bf399c7f15ab020df685780faf4cb5.msu).

@vitaliy-leschenko
Copy link

vitaliy-leschenko commented Nov 27, 2020

I am sorry about my previous comment but the issues does exist.
Windows pods can communicate with services only when it communicate with pod (from the service) on the same node.
When it load balanced to another node it failed.

More details: kubernetes-sigs/sig-windows-tools#127

@vitaliy-leschenko
Copy link

@immuzz @daschott can you reopen the issue?

@vitaliy-leschenko
Copy link

Build 17763.1577 also doesn't work.

@daschott daschott reopened this Dec 5, 2020
@daschott
Copy link
Author

daschott commented Dec 5, 2020

@vitaliy-leschenko Can you please try out some of the troubleshooting steps here, and give us output of collectlogs.ps1 script:
https://techcommunity.microsoft.com/t5/networking-blog/troubleshooting-kubernetes-networking-on-windows-part-1/ba-p/508648

Specifically, can you reproduce the output shown in example #1 and example #3 in the above doc? Do all the Windows nodes have the same patch status 17763.1577?

@vitaliy-leschenko
Copy link

Ok. I will try

@vitaliy-leschenko
Copy link

I tried test Windows Server 1809 with the latest updates installed. Version: 10.0.17763.1637. It has issue with pod to service connectivity.

curl -LO https://raw.githubusercontent.com/kubernetes-sigs/sig-windows-tools/master/hack/DebugWindowsNode.ps1
image

Can't test .\ValidateKubernetes.Pester.tests.ps1 because:

  • it designed for old k8s version
  • I use flannel in overlay mode instead of l2bridge

Currently we have situation:

  1. windows pods have internet access
  2. windows pods have access to k8s dns
  3. windows pods can communicate with services only if their pods hosted on the same node.
  4. windows pods can communicate with any pods by podIP

I created sample to reproduce the issue: https://vitaliyorgstorage.azureedge.net/9d4196ab-1c3c-4efb-b065-b643384f1832/github/sample.yaml
It contains namespace issue with 2 services: iis (for windows nodes) and nginx (for linux nodes)
Then I tried to connect to iis pods and try to:

  • curl -Ikv iis.issues.svc.cluster.local - sometimes OK (if load-balanser route request to the same node). Otherwise failed.
  • curl -Ikv nginx.issues.svc.cluster.local - always OK

@vitaliy-leschenko
Copy link

Logs for https://raw.githubusercontent.com/microsoft/SDN/master/Kubernetes/windows/debug/collectlogs.ps1 can be downloaded from: https://vitaliyorgstorage.azureedge.net/9d4196ab-1c3c-4efb-b065-b643384f1832/github/2ybmbpfj.354.zip

@daschott
Copy link
Author

Sorry for the delay here.

What do the logs look like from the problematic node? Is FlannelD running on problematic node? Also, what do FlannelD logs look like on problematic node? There should be entries such as

Subnet added:  10.244.1.0/24 via 192.168.99.110
...

@vitaliy-leschenko
Copy link

FlannelD is running. Pods can communicate vis IP addresses.

There is flannel log from another cluster with the same issue:

I0102 12:48:46.720161    7476 main.go:518] Determining IP address of default interface
I0102 12:48:46.897166    7476 main.go:531] Using interface with name vEthernet (Ethernet) 2 and address 192.168.191.174
I0102 12:48:46.897166    7476 main.go:548] Defaulting external address to interface address (192.168.191.174)
I0102 12:48:46.909163    7476 kube.go:119] Waiting 10m0s for node controller to sync
I0102 12:48:46.909163    7476 kube.go:306] Starting kube subnet manager
I0102 12:48:47.909266    7476 kube.go:126] Node controller sync successful
I0102 12:48:47.909266    7476 main.go:246] Created subnet manager: Kubernetes Subnet Manager - pq-node01
I0102 12:48:47.909266    7476 main.go:249] Installing signal handlers
I0102 12:48:47.909266    7476 main.go:390] Found network config - Backend type: vxlan
I0102 12:48:47.909266    7476 vxlan_windows.go:127] VXLAN config: Name=flannel.4096 MacPrefix=0E-2A VNI=4096 Port=4789 GBP=false DirectRouting=false
I0102 12:48:47.929266    7476 device_windows.go:116] Attempting to create HostComputeNetwork &{ flannel.4096 Overlay [] {[]} { [] [] []} [{Static [{10.244.2.0/24 [[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 83 101 116 116 105 110 103 115 34 58 123 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 125 125]] [{10.244.2.1 0.0.0.0/0 0}]}]}] 8 {2 0}}
I0102 12:48:48.022269    7476 device_windows.go:124] Waiting to get ManagementIP from HostComputeNetwork flannel.4096
I0102 12:48:48.523266    7476 device_windows.go:136] Waiting to get net interface for HostComputeNetwork flannel.4096 (192.168.191.174)
I0102 12:48:49.114752    7476 device_windows.go:145] Created HostComputeNetwork flannel.4096
I0102 12:48:49.123752    7476 main.go:313] Changing default FORWARD chain policy to ACCEPT
I0102 12:48:49.123752    7476 main.go:321] Wrote subnet file to /run/flannel/subnet.env
I0102 12:48:49.123752    7476 main.go:325] Running backend.
I0102 12:48:49.123752    7476 main.go:343] Waiting for all goroutines to exit
I0102 12:48:49.123752    7476 vxlan_network_windows.go:63] Watching for new subnet leases
E0111 12:42:08.174287    7476 vxlan_network_windows.go:100] error decoding subnet lease JSON: invalid MAC address
E0118 07:27:38.361084    7476 vxlan_network_windows.go:100] error decoding subnet lease JSON: invalid MAC address
E0119 01:48:13.671057    7476 vxlan_network_windows.go:100] error decoding subnet lease JSON: invalid MAC address

@vitaliy-leschenko
Copy link

When we setup flannel as host-gw we can see Subnet added: 10.244.1.0/24 via 192.168.99.110 in logs.
But when we setup flannel as vxlan we always see error decoding subnet lease JSON: invalid MAC address.
Usually it doesn't affect connectivity issue.

I think we have issue with kube-proxy because it works as service load-balancer.

@daschott
Copy link
Author

Windows pods can communicate with services only when it communicate with pod (from the service) on the same node.

This sounds to me like pod-pod issue across nodes. The error in the FlannelD logs also sound like there could be a misconfiguration on different node - do you still have any nodes attached in the cluster that are configured in host-gw or have some error in the FlannelD configuration?

@vitaliy-leschenko
Copy link

I have no nodes in host-gw mode. I had it only for tests when tried to reproduce the issue.
Usually I use vxlan mode because it works better on my hardware. However when I upgrade my VM to newer build that 10.0.17763.1294 I have troubles with the issue.
I don't have any errors in flannel logs and kube-proxy. It looks like OK but doesn't work.

@ghost
Copy link

ghost commented Mar 19, 2021

This issue has been open for 30 days with no updates.
@daschott, please provide an update or close this issue.

@daschott
Copy link
Author

Sorry for the long delay. In theory, DNS relies on service connectivity so it is surprising to see the two statements:

windows pods can communicate with services only if their pods hosted on the same node.
windows pods have access to k8s dns

There is a relevant fix that came out on February. Can you try to update to latest version? and then provide the following on the problematic node:

  1. CollectLogs.ps1 again
  2. The IP address of the source POD
  3. The IP address of the destination service
  4. The kube-proxy logs

@vrapolinario
Copy link
Contributor

Any updates on this thread? Otherwise we'll go ahead and close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Networking Connectivity and network infrastructure
Projects
None yet
Development

No branches or pull requests

4 participants