Pod to service connectivity issues on August and September cumulative updates on Windows Server 2019 #61

daschott · 2020-10-06T17:01:58Z

In Kubernetes on Windows Server, DECAP_IN VFP layer gets dropped on 8C and 9B cumulative updates on Windows Server 2019 when HNS service gets restarted. That may cause pod -> service traffic to fail in some cases and configurations.

If a Windows Server 2019 machine needs to be restarted, a workaround to try out is to:

pause kube-proxy 
Get-HNSPolicyList | Remove-HNSPolicyList
<restart>

This regression will be resolved in the cumulative update being released in 3rd week of October. This issue will not surface on Windows Server 1903 and above.

The text was updated successfully, but these errors were encountered:

daschott · 2020-10-13T19:35:58Z

The ETA for the fixing patch is next week. For those impacted users that need a (test-signed!) fix ahead of next week, please reach out to your Microsoft customer support contact to request such.

llyons · 2020-10-13T20:35:45Z

when you say

pause kube-proxy

what are you meaning there. Pause just puts up a prompt to press any key to continue.

does mean reboot the system?

daschott · 2020-10-13T21:45:01Z

What I mean, is to stop the kube-proxy such that no HNS policies are being re-created after you ran Get-HNSPolicyList | Remove-HNSPolicyList as a result of kube-proxy actively running on the system.

vitaliy-leschenko · 2020-10-14T16:56:55Z

KB4577668 has been released on October 13, 2020 (OS Build 17763.1518).
Does it fix the issue?

daschott · 2020-10-14T17:11:34Z

No, this still has the issue. The next KB should have the fix. ETA is October 20th

vitaliy-leschenko · 2020-10-21T11:13:54Z

Hi again,
KB4580390 has been released on October 20, 2020 (OS Build 17763.1554).
Does it fix the issue?

At page (https://support.microsoft.com/en-us/help/4580390/windows-10-update-kb4580390) it marked as Preview. Is it final KB or we need to wait another one?

daschott · 2020-10-21T23:31:57Z

@vitaliy-leschenko

For the "preview" naming please see here:
https://techcommunity.microsoft.com/t5/windows-it-pro-blog/resuming-optional-windows-10-and-windows-server-non-security/ba-p/1471429

It is the final production KB but optional, meaning users need to seek it out. It will also be included in the next month's "B" release update, hence the name "preview"...

This KB contains the fix for this issue. Can you try it out and confirm whether the issue still reproduces or not?

vitaliy-leschenko · 2020-10-22T08:11:31Z

I confirm that my nodes that were updated to 10.0.17763.1554 have no pod to services connectivity issue.
All my nodes were updated manually from 10.0.17763.1294 via install the msu (http://download.windowsupdate.com/c/msdownload/update/software/updt/2020/10/windows10.0-kb4580390-x64_743bc31f33bf399c7f15ab020df685780faf4cb5.msu).

vitaliy-leschenko · 2020-11-27T11:12:49Z

I am sorry about my previous comment but the issues does exist.
Windows pods can communicate with services only when it communicate with pod (from the service) on the same node.
When it load balanced to another node it failed.

More details: kubernetes-sigs/sig-windows-tools#127

vitaliy-leschenko · 2020-11-27T11:13:57Z

@immuzz @daschott can you reopen the issue?

vitaliy-leschenko · 2020-11-27T14:56:38Z

Build 17763.1577 also doesn't work.

daschott · 2020-12-05T00:53:02Z

@vitaliy-leschenko Can you please try out some of the troubleshooting steps here, and give us output of collectlogs.ps1 script:
https://techcommunity.microsoft.com/t5/networking-blog/troubleshooting-kubernetes-networking-on-windows-part-1/ba-p/508648

Specifically, can you reproduce the output shown in example #1 and example #3 in the above doc? Do all the Windows nodes have the same patch status 17763.1577?

vitaliy-leschenko · 2020-12-05T06:17:49Z

Ok. I will try

vitaliy-leschenko · 2020-12-22T19:09:49Z

I tried test Windows Server 1809 with the latest updates installed. Version: 10.0.17763.1637. It has issue with pod to service connectivity.

curl -LO https://raw.githubusercontent.com/kubernetes-sigs/sig-windows-tools/master/hack/DebugWindowsNode.ps1

Can't test .\ValidateKubernetes.Pester.tests.ps1 because:

it designed for old k8s version
I use flannel in overlay mode instead of l2bridge

Currently we have situation:

windows pods have internet access
windows pods have access to k8s dns
windows pods can communicate with services only if their pods hosted on the same node.
windows pods can communicate with any pods by podIP

I created sample to reproduce the issue: https://vitaliyorgstorage.azureedge.net/9d4196ab-1c3c-4efb-b065-b643384f1832/github/sample.yaml
It contains namespace issue with 2 services: iis (for windows nodes) and nginx (for linux nodes)
Then I tried to connect to iis pods and try to:

curl -Ikv iis.issues.svc.cluster.local - sometimes OK (if load-balanser route request to the same node). Otherwise failed.
curl -Ikv nginx.issues.svc.cluster.local - always OK

vitaliy-leschenko · 2020-12-22T19:16:48Z

Logs for https://raw.githubusercontent.com/microsoft/SDN/master/Kubernetes/windows/debug/collectlogs.ps1 can be downloaded from: https://vitaliyorgstorage.azureedge.net/9d4196ab-1c3c-4efb-b065-b643384f1832/github/2ybmbpfj.354.zip

daschott · 2021-01-21T00:02:53Z

Sorry for the delay here.

What do the logs look like from the problematic node? Is FlannelD running on problematic node? Also, what do FlannelD logs look like on problematic node? There should be entries such as

Subnet added:  10.244.1.0/24 via 192.168.99.110
...

vitaliy-leschenko · 2021-01-21T18:00:18Z

FlannelD is running. Pods can communicate vis IP addresses.

There is flannel log from another cluster with the same issue:

I0102 12:48:46.720161    7476 main.go:518] Determining IP address of default interface
I0102 12:48:46.897166    7476 main.go:531] Using interface with name vEthernet (Ethernet) 2 and address 192.168.191.174
I0102 12:48:46.897166    7476 main.go:548] Defaulting external address to interface address (192.168.191.174)
I0102 12:48:46.909163    7476 kube.go:119] Waiting 10m0s for node controller to sync
I0102 12:48:46.909163    7476 kube.go:306] Starting kube subnet manager
I0102 12:48:47.909266    7476 kube.go:126] Node controller sync successful
I0102 12:48:47.909266    7476 main.go:246] Created subnet manager: Kubernetes Subnet Manager - pq-node01
I0102 12:48:47.909266    7476 main.go:249] Installing signal handlers
I0102 12:48:47.909266    7476 main.go:390] Found network config - Backend type: vxlan
I0102 12:48:47.909266    7476 vxlan_windows.go:127] VXLAN config: Name=flannel.4096 MacPrefix=0E-2A VNI=4096 Port=4789 GBP=false DirectRouting=false
I0102 12:48:47.929266    7476 device_windows.go:116] Attempting to create HostComputeNetwork &{ flannel.4096 Overlay [] {[]} { [] [] []} [{Static [{10.244.2.0/24 [[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 83 101 116 116 105 110 103 115 34 58 123 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 125 125]] [{10.244.2.1 0.0.0.0/0 0}]}]}] 8 {2 0}}
I0102 12:48:48.022269    7476 device_windows.go:124] Waiting to get ManagementIP from HostComputeNetwork flannel.4096
I0102 12:48:48.523266    7476 device_windows.go:136] Waiting to get net interface for HostComputeNetwork flannel.4096 (192.168.191.174)
I0102 12:48:49.114752    7476 device_windows.go:145] Created HostComputeNetwork flannel.4096
I0102 12:48:49.123752    7476 main.go:313] Changing default FORWARD chain policy to ACCEPT
I0102 12:48:49.123752    7476 main.go:321] Wrote subnet file to /run/flannel/subnet.env
I0102 12:48:49.123752    7476 main.go:325] Running backend.
I0102 12:48:49.123752    7476 main.go:343] Waiting for all goroutines to exit
I0102 12:48:49.123752    7476 vxlan_network_windows.go:63] Watching for new subnet leases
E0111 12:42:08.174287    7476 vxlan_network_windows.go:100] error decoding subnet lease JSON: invalid MAC address
E0118 07:27:38.361084    7476 vxlan_network_windows.go:100] error decoding subnet lease JSON: invalid MAC address
E0119 01:48:13.671057    7476 vxlan_network_windows.go:100] error decoding subnet lease JSON: invalid MAC address

vitaliy-leschenko · 2021-01-23T08:53:26Z

When we setup flannel as host-gw we can see Subnet added: 10.244.1.0/24 via 192.168.99.110 in logs.
But when we setup flannel as vxlan we always see error decoding subnet lease JSON: invalid MAC address.
Usually it doesn't affect connectivity issue.

I think we have issue with kube-proxy because it works as service load-balancer.

daschott · 2021-02-16T21:44:14Z

Windows pods can communicate with services only when it communicate with pod (from the service) on the same node.

This sounds to me like pod-pod issue across nodes. The error in the FlannelD logs also sound like there could be a misconfiguration on different node - do you still have any nodes attached in the cluster that are configured in host-gw or have some error in the FlannelD configuration?

vitaliy-leschenko · 2021-02-16T22:12:04Z

I have no nodes in host-gw mode. I had it only for tests when tried to reproduce the issue.
Usually I use vxlan mode because it works better on my hardware. However when I upgrade my VM to newer build that 10.0.17763.1294 I have troubles with the issue.
I don't have any errors in flannel logs and kube-proxy. It looks like OK but doesn't work.

ghost · 2021-03-19T16:01:05Z

This issue has been open for 30 days with no updates.
@daschott, please provide an update or close this issue.

daschott · 2021-04-14T21:32:53Z

Sorry for the long delay. In theory, DNS relies on service connectivity so it is surprising to see the two statements:

windows pods can communicate with services only if their pods hosted on the same node.
windows pods have access to k8s dns

There is a relevant fix that came out on February. Can you try to update to latest version? and then provide the following on the problematic node:

CollectLogs.ps1 again
The IP address of the source POD
The IP address of the destination service
The kube-proxy logs

vrapolinario · 2021-05-12T21:35:49Z

Any updates on this thread? Otherwise we'll go ahead and close it.

ghost added the triage New and needs attention label Oct 6, 2020

vrapolinario assigned daschott Oct 6, 2020

vrapolinario added bug Something isn't working Networking Connectivity and network infrastructure and removed triage New and needs attention labels Oct 6, 2020

This was referenced Oct 13, 2020

DNS resolution is not working for windows pod via KubeDNS ClusterIP in 1.19.2 - Rancher 2.5.0 rancher/rancher#29399

Closed

[Logging V2] Windows Support rancher/rancher#28721

Closed

llyons mentioned this issue Oct 14, 2020

windows pods are not reachable on a hybrid Linux/ windows cluster kubernetes-sigs/sig-windows-tools#103

Open

daschott closed this as completed Oct 22, 2020

daschott reopened this Dec 5, 2020

daschott closed this as completed May 12, 2021

dduportal mentioned this issue Oct 25, 2022

Agent instance fails to connect to master despite port being open jenkinsci/docker-agent#669

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod to service connectivity issues on August and September cumulative updates on Windows Server 2019 #61

Pod to service connectivity issues on August and September cumulative updates on Windows Server 2019 #61

daschott commented Oct 6, 2020 •

edited

Loading

daschott commented Oct 13, 2020

llyons commented Oct 13, 2020

daschott commented Oct 13, 2020

vitaliy-leschenko commented Oct 14, 2020

daschott commented Oct 14, 2020

vitaliy-leschenko commented Oct 21, 2020

daschott commented Oct 21, 2020

vitaliy-leschenko commented Oct 22, 2020 •

edited

Loading

vitaliy-leschenko commented Nov 27, 2020 •

edited

Loading

vitaliy-leschenko commented Nov 27, 2020

vitaliy-leschenko commented Nov 27, 2020

daschott commented Dec 5, 2020 •

edited

Loading

vitaliy-leschenko commented Dec 5, 2020

vitaliy-leschenko commented Dec 22, 2020

vitaliy-leschenko commented Dec 22, 2020

daschott commented Jan 21, 2021

vitaliy-leschenko commented Jan 21, 2021

vitaliy-leschenko commented Jan 23, 2021

daschott commented Feb 16, 2021

vitaliy-leschenko commented Feb 16, 2021

ghost commented Mar 19, 2021

daschott commented Apr 14, 2021

vrapolinario commented May 12, 2021

Pod to service connectivity issues on August and September cumulative updates on Windows Server 2019 #61

Pod to service connectivity issues on August and September cumulative updates on Windows Server 2019 #61

Comments

daschott commented Oct 6, 2020 • edited Loading

daschott commented Oct 13, 2020

llyons commented Oct 13, 2020

daschott commented Oct 13, 2020

vitaliy-leschenko commented Oct 14, 2020

daschott commented Oct 14, 2020

vitaliy-leschenko commented Oct 21, 2020

daschott commented Oct 21, 2020

vitaliy-leschenko commented Oct 22, 2020 • edited Loading

vitaliy-leschenko commented Nov 27, 2020 • edited Loading

vitaliy-leschenko commented Nov 27, 2020

vitaliy-leschenko commented Nov 27, 2020

daschott commented Dec 5, 2020 • edited Loading

vitaliy-leschenko commented Dec 5, 2020

vitaliy-leschenko commented Dec 22, 2020

vitaliy-leschenko commented Dec 22, 2020

daschott commented Jan 21, 2021

vitaliy-leschenko commented Jan 21, 2021

vitaliy-leschenko commented Jan 23, 2021

daschott commented Feb 16, 2021

vitaliy-leschenko commented Feb 16, 2021

ghost commented Mar 19, 2021

daschott commented Apr 14, 2021

vrapolinario commented May 12, 2021

daschott commented Oct 6, 2020 •

edited

Loading

vitaliy-leschenko commented Oct 22, 2020 •

edited

Loading

vitaliy-leschenko commented Nov 27, 2020 •

edited

Loading

daschott commented Dec 5, 2020 •

edited

Loading