Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable TX checksum offload for Flannel VXLAN #9074

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions nodeup/pkg/model/network.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ import (
"fmt"
"path/filepath"

"k8s.io/klog"
"k8s.io/kops/pkg/systemd"
"k8s.io/kops/upup/pkg/fi"
"k8s.io/kops/upup/pkg/fi/nodeup/nodetasks"
)
Expand Down Expand Up @@ -66,6 +68,13 @@ func (b *NetworkBuilder) Build(c *fi.ModelBuilderContext) error {
}
}

// Tx checksum offloading is buggy for NAT-ed VXLAN endpoints, leading to an invalid checksum sent and causing
// Flannel to stop to working as the traffic is being discarded by the receiver.
// https://github.com/coreos/flannel/issues/1279
if networking != nil && (networking.Canal != nil || (networking.Flannel != nil && networking.Flannel.Backend == "vxlan")) {
c.AddTask(b.buildFlannelTxChecksumOffloadDisableService())
}

return nil
}

Expand All @@ -88,3 +97,27 @@ func (b *NetworkBuilder) addCNIBinAsset(c *fi.ModelBuilderContext, assetName str

return nil
}

func (b *NetworkBuilder) buildFlannelTxChecksumOffloadDisableService() *nodetasks.Service {
const serviceName = "flannel-tx-checksum-offload-disable.service"

manifest := &systemd.Manifest{}
manifest.Set("Unit", "Description", "Disable TX checksum offload on flannel.1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does canal also use the name flannel.1 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I tested this with Canal

Copy link
Contributor

@joshbranham joshbranham May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we run canal and flannel.1 is the interface name

ubuntu@ip-10-129-0-88:~$ ifconfig | grep flannel
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 8951

Copy link
Contributor

@vvbogdanov87 vvbogdanov87 Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in my setup. By the time flannel-tx-checksum-offload-disable is started, flannel device is not ready.

admin@ip-10-63-80-157:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group 
...
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state 
...

This change breaks my cluster. The service flannel-tx-checksum-offload-disable returns an error that prevents kops-configuration to finish its job.

admin@ip-10-63-80-157:~$ sudo /sbin/ethtool -K flannel.1 tx-checksum-ip-generic off
Cannot get device feature names: No such device

I'm running on AWS using canal

  networking:
    canal: {}

I can't find sys-devices-virtual-net-flannel.1.device

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this block kops-configuration from finishing the job? Worst case scenario it stays as the last job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. My bad. Looks like flannel pod wasn't started because the node wasn't able to join the cluster. As a result, the device wasn't created and kops-configuration service log was spammed by the messages from flannel-tx-checksum-offload-disable service. I had to manually fix the issue with etcd manager certificate first(my cert was already expired and rolling upgrade to the latest kops 1.17.0 failed).


manifest.Set("Unit", "After", "sys-devices-virtual-net-flannel.1.device")
manifest.Set("Install", "WantedBy", "sys-devices-virtual-net-flannel.1.device")
manifest.Set("Service", "Type", "oneshot")
manifest.Set("Service", "ExecStart", "/sbin/ethtool -K flannel.1 tx-checksum-ip-generic off")

manifestString := manifest.Render()
klog.V(8).Infof("Built service manifest %q\n%s", serviceName, manifestString)

service := &nodetasks.Service{
Name: serviceName,
Definition: s(manifestString),
}

service.InitDefaults()

return service
}