Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching AWS_VPC_CNI_NODE_PORT_SUPPORT to false breaks SNAT for Pods #2025

Closed
antoninbas opened this issue Jul 8, 2022 · 6 comments · Fixed by #2026
Closed

Switching AWS_VPC_CNI_NODE_PORT_SUPPORT to false breaks SNAT for Pods #2025

antoninbas opened this issue Jul 8, 2022 · 6 comments · Fixed by #2026
Assignees
Labels

Comments

@antoninbas
Copy link
Contributor

antoninbas commented Jul 8, 2022

What happened:
When setting AWS_VPC_CNI_NODE_PORT_SUPPORT to false for aws-node, SNAT breaks for all Pods for which the IP is assigned from a secondary network interface / ENI.

How to reproduce it (as minimally and precisely as possible):

  1. Create a default EKS cluster with a single Node, and use a "small-ish" instance type (so that the primary ENI can support a smaller number of addresses). For example: eksctl create cluster -N 1 --ssh-access --node-type t3.medium.
  2. Edit the aws-node Deployment to set AWS_VPC_CNI_NODE_PORT_SUPPORT to false. AWS_VPC_K8S_CNI_EXTERNALSNAT should remain set to false, so that the CNI is responsible for SNATting Pod traffic.
  3. Create a Deployment with 5 replicas. There is a single Node so all 5 Pods will run on that Node. For example, kubectl apply the following YAML:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: toolbox
spec:
  replicas: 5
  selector:
    matchLabels:
      app: toolbox
  template:
    metadata:
      labels:
        app: toolbox
    spec:
      containers:
      - name: toolbox
        image: antoninbas/toolbox
        command: ["sleep", "3600"]
  1. check which Pods are assigned to a secondary ENI. I do this by running ip rule list and matching the rules for route table to to Pod IPs.
  2. exec into one of these Pods and ping 8.8.8.8. It will fail.

Anything else we need to know?:

This is caused by some incorrect iptables rules:

  • iptableRules = append(iptableRules, iptablesRule{
    name: "connmark to fwmark copy",
    shouldExist: !n.useExternalSNAT,
    table: "nat",
    chain: "PREROUTING",
    rule: []string{
    "-m", "comment", "--comment", "AWS, CONNMARK", "-j", "CONNMARK",
    "--restore-mark", "--mask", fmt.Sprintf("%#x", n.mainENIMark),
    },
    })
    is in the nat table and hence only applies to the first packet of the connection
  • iptableRules = append(iptableRules, iptablesRule{
    name: "connmark restore for primary ENI",
    shouldExist: n.nodePortSupportEnabled,
    table: "mangle",
    chain: "PREROUTING",
    rule: []string{
    "-m", "comment", "--comment", "AWS, primary ENI",
    "-i", n.vethPrefix + "+", "-j", "CONNMARK", "--restore-mark", "--mask", fmt.Sprintf("%#x", n.mainENIMark),
    },
    })
    should be installed if AWS_VPC_CNI_NODE_PORT_SUPPORT is true OR if AWS_VPC_K8S_CNI_EXTERNALSNAT false (default).

Additionally, mainENIRule IP rule also needs to be installed, and loose rpf_filter needs to be configured.

I will submit a PR with these changes.

Environment:
eksctl version

0.105.0-rc.0

K8s version

Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.10-eks-84b4fe6", GitCommit:"cc6a1b4915a99f49f5510ef0667f94b9ca832a8a", GitTreeState:"clean", BuildDate:"2022-06-09T18:24:04Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

credit also goes to my colleague @tnqn for investigating this

@antoninbas antoninbas added the bug label Jul 8, 2022
antoninbas added a commit to antoninbas/amazon-vpc-cni-k8s that referenced this issue Jul 8, 2022
* the rules that need to be installed for NodePort support and SNAT
  support are very similar. The same traffic mark is needed for both. As
  a result, rules that are currently installed only when NodePort
  support is enabled should also be installed when external SNAT is
  disabled, which is the case by default.
* remove a useless iptables rule in the nat table (for restoring the
  traffic mark); the rule would only apply to the first packet of each
  connection. The mangle table should be in charge of restoring the
  traffic mark.
* remove "-m state --state NEW" from a rule in the nat table. This is
  always true for packets that traverse the nat table.
* fix typo in one rule's name (extra whitespace).

Fixes aws#2025

Co-authored-by: Quan Tian <qtian@vmware.com>

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/amazon-vpc-cni-k8s that referenced this issue Jul 8, 2022
* the rules that need to be installed for NodePort support and SNAT
  support are very similar. The same traffic mark is needed for both. As
  a result, rules that are currently installed only when NodePort
  support is enabled should also be installed when external SNAT is
  disabled, which is the case by default.
* remove "-m state --state NEW" from a rule in the nat table. This is
  always true for packets that traverse the nat table.
* fix typo in one rule's name (extra whitespace).

Fixes aws#2025

Co-authored-by: Quan Tian <qtian@vmware.com>

Signed-off-by: Antonin Bas <abas@vmware.com>
@jayanthvn
Copy link
Contributor

Hi,

Thanks for opening the PR for this issue. I assume you are on 1.10.x since k8s version is 1.22 right? Have you tried a similar test on 1.7 branch?

antoninbas added a commit to antoninbas/amazon-vpc-cni-k8s that referenced this issue Jul 12, 2022
* the rules that need to be installed for NodePort support and SNAT
  support are very similar. The same traffic mark is needed for both. As
  a result, rules that are currently installed only when NodePort
  support is enabled should also be installed when external SNAT is
  disabled, which is the case by default.
* remove "-m state --state NEW" from a rule in the nat table. This is
  always true for packets that traverse the nat table.
* fix typo in one rule's name (extra whitespace).

Fixes aws#2025

Co-authored-by: Quan Tian <qtian@vmware.com>

Signed-off-by: Antonin Bas <abas@vmware.com>
@antoninbas
Copy link
Contributor Author

@jayanthvn I am using v1.10.1-eksbuild.1. I can test with 1.7. Do you expect a different result?

BTW, I have updated my PR with a unit test (also fixed test failures for existing unit test cases which were introduced by my change).

@antoninbas
Copy link
Contributor Author

@jayanthvn I just finished testing with v1.7.10. I can confirm that the problem does not exist with the 1.7 version.

The IP rules are different for that older version:

0:	from all lookup local
512:	from all to 192.168.84.247 lookup main
512:	from all to 192.168.83.153 lookup main
512:	from all to 192.168.89.109 lookup main
512:	from all to 192.168.92.230 lookup main
512:	from all to 192.168.72.42 lookup main
512:	from all to 192.168.88.212 lookup main
512:	from all to 192.168.83.69 lookup main
1536:	from 192.168.84.247 to 192.168.0.0/16 lookup 3
1536:	from 192.168.83.153 to 192.168.0.0/16 lookup 3
1536:	from 192.168.89.109 to 192.168.0.0/16 lookup 2
1536:	from 192.168.92.230 to 192.168.0.0/16 lookup 3
1536:	from 192.168.72.42 to 192.168.0.0/16 lookup 3
1536:	from 192.168.88.212 to 192.168.0.0/16 lookup 3
32766:	from all lookup main
32767:	from all lookup default

The rules which select a secondary route table (for Pods assigned to secondary ENIs) have an extra to clause compared to 1.10. For example:

1536:	from 192.168.84.247 to 192.168.0.0/16 lookup 3

Which means that SNATed traffic still hits the main route table.

With 1.10, the rules are as follows:

0:	from all lookup local
512:	from all to 192.168.29.56 lookup main
512:	from all to 192.168.24.134 lookup main
512:	from all to 192.168.31.135 lookup main
512:	from all to 192.168.31.223 lookup main
512:	from all to 192.168.29.27 lookup main
512:	from all to 192.168.16.158 lookup main
512:	from all to 192.168.2.135 lookup main
1024:	from all fwmark 0x80/0x80 lookup main
1536:	from 192.168.31.223 lookup 2
1536:	from 192.168.29.27 lookup 2
1536:	from 192.168.16.158 lookup 2
1536:	from 192.168.2.135 lookup 2
32766:	from all lookup main
32767:	from all lookup default

So SNATed traffic needs to have the 0x80 mark and the bypass rule (1024: from all fwmark 0x80/0x80 lookup main) is required. Otherwise it goes to secondary route tables, and it egresses through a secondary ENI (but with the source IP set to the primary IP).

antoninbas added a commit to antoninbas/amazon-vpc-cni-k8s that referenced this issue Jul 13, 2022
* the rules that need to be installed for NodePort support and SNAT
  support are very similar. The same traffic mark is needed for both. As
  a result, rules that are currently installed only when NodePort
  support is enabled should also be installed when external SNAT is
  disabled, which is the case by default.
* remove "-m state --state NEW" from a rule in the nat table. This is
  always true for packets that traverse the nat table.
* fix typo in one rule's name (extra whitespace).

Fixes aws#2025

Co-authored-by: Quan Tian <qtian@vmware.com>

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/amazon-vpc-cni-k8s that referenced this issue Aug 19, 2022
* the rules that need to be installed for NodePort support and SNAT
  support are very similar. The same traffic mark is needed for both. As
  a result, rules that are currently installed only when NodePort
  support is enabled should also be installed when external SNAT is
  disabled, which is the case by default.
* remove "-m state --state NEW" from a rule in the nat table. This is
  always true for packets that traverse the nat table.
* fix typo in one rule's name (extra whitespace).

Fixes aws#2025

Co-authored-by: Quan Tian <qtian@vmware.com>

Signed-off-by: Antonin Bas <abas@vmware.com>
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Sep 25, 2022
@antoninbas
Copy link
Contributor Author

/remove stale

@github-actions github-actions bot removed the stale Issue or PR is stale label Sep 27, 2022
sushrk added a commit that referenced this issue Nov 1, 2022
* Add missing rules when NodePort support is disabled

* the rules that need to be installed for NodePort support and SNAT
  support are very similar. The same traffic mark is needed for both. As
  a result, rules that are currently installed only when NodePort
  support is enabled should also be installed when external SNAT is
  disabled, which is the case by default.
* remove "-m state --state NEW" from a rule in the nat table. This is
  always true for packets that traverse the nat table.
* fix typo in one rule's name (extra whitespace).

Fixes #2025

Co-authored-by: Quan Tian <qtian@vmware.com>

Signed-off-by: Antonin Bas <abas@vmware.com>

* Fix typos and unit tests

Signed-off-by: Antonin Bas <abas@vmware.com>

* Minor improvement to code comment

Signed-off-by: Antonin Bas <abas@vmware.com>

* Address review comments

* Delete legacy nat rule
* Fix an unrelated log message

Signed-off-by: Antonin Bas <abas@vmware.com>

Signed-off-by: Antonin Bas <abas@vmware.com>
Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Co-authored-by: Sushmitha Ravikumar <58063229+sushrk@users.noreply.github.com>
@github-actions
Copy link

github-actions bot commented Nov 1, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

jdn5126 added a commit that referenced this issue Dec 12, 2022
* create publisher with logger (#2119)

* Add missing rules when NodePort support is disabled (#2026)

* Add missing rules when NodePort support is disabled

* the rules that need to be installed for NodePort support and SNAT
  support are very similar. The same traffic mark is needed for both. As
  a result, rules that are currently installed only when NodePort
  support is enabled should also be installed when external SNAT is
  disabled, which is the case by default.
* remove "-m state --state NEW" from a rule in the nat table. This is
  always true for packets that traverse the nat table.
* fix typo in one rule's name (extra whitespace).

Fixes #2025

Co-authored-by: Quan Tian <qtian@vmware.com>

Signed-off-by: Antonin Bas <abas@vmware.com>

* Fix typos and unit tests

Signed-off-by: Antonin Bas <abas@vmware.com>

* Minor improvement to code comment

Signed-off-by: Antonin Bas <abas@vmware.com>

* Address review comments

* Delete legacy nat rule
* Fix an unrelated log message

Signed-off-by: Antonin Bas <abas@vmware.com>

Signed-off-by: Antonin Bas <abas@vmware.com>
Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Co-authored-by: Sushmitha Ravikumar <58063229+sushrk@users.noreply.github.com>

* downgrade test go.mod to align with root go.mod (#2128)

* skip addon installation when addon info is not available (#2131)

* Merging test/Makefile and test/go.mod to the root Makefil and go.mod, adjust the .github/workflows and integration test instructions (#2129)

* update troubleshooting docs for CNI image (#2132)

fix location where make command is run

* fix env name in test script (#2136)

* optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip (#2138)

* optionally allow CLUSTER_ENDPOINT to be used rather than the kubernetes cluster ip

* remove check for kube-proxy

* add version to readme

* Add resources config option to cni metrics helper (#2141)

* Add resources config option to cni metrics helper

* Remove default-empty resources block; replace with conditional

* Add metrics for ec2 api calls made by CNI and expose via prometheus (#2142)

Co-authored-by: Jay Deokar <jsdeokar@amazon.com>

* increase workflow role duration to 4 hours (#2148)

* Update golang 1.19.2 EKS-D (#2147)

* Update golang

* Move to EKS distro builds

* [HELM]: Move CRD resources to a separate folder as per helm standard (#2144)

Co-authored-by: Jay Deokar <jsdeokar@amazon.com>

* VPC-CNI minimal image builds (#2146)

* VPC-CNI minimal image builds

* update dependencies for ginkgo when running integration tests

* address review comments and break up init main function

* review comments for sysctl

* Simplify binary installation, fix review comments

Since init container is required to always run, let binary installation
for external plugins happen in init container. This simplifies the main
container entrypoint and the dockerfile for each image.

* when IPAMD connection fails, try to teardown pod network using prevResult (#2145)

* add env var to enable nftables (#2155)

* fix failing weekly cron tests (#2154)

* Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER and remove no-op setter (#2153)

* Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER

* update release version comments

Signed-off-by: Antonin Bas <abas@vmware.com>
Co-authored-by: Jeffrey Nelson <jdnelson@amazon.com>
Co-authored-by: Antonin Bas <antonin.bas@gmail.com>
Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Co-authored-by: Sushmitha Ravikumar <58063229+sushrk@users.noreply.github.com>
Co-authored-by: Jerry He <37866862+jerryhe1999@users.noreply.github.com>
Co-authored-by: Brandon Wagner <wagnerbm@amazon.com>
Co-authored-by: Jonathan Ogilvie <679297+jcogilvie@users.noreply.github.com>
Co-authored-by: Jay Deokar <jsdeokar@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants