-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CNI]: Teardown pod network when IPAMD connection fails #2145
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jdn5126
force-pushed
the
ip_rule_leak_fallback
branch
from
November 29, 2022 23:33
6bacc9b
to
968645c
Compare
M00nF1sh
previously approved these changes
Dec 8, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
jdn5126
force-pushed
the
ip_rule_leak_fallback
branch
from
December 8, 2022 18:03
968645c
to
b46a510
Compare
jdn5126
force-pushed
the
ip_rule_leak_fallback
branch
from
December 8, 2022 18:05
b46a510
to
f452396
Compare
M00nF1sh
approved these changes
Dec 8, 2022
jdn5126
added a commit
that referenced
this pull request
Dec 12, 2022
* create publisher with logger (#2119) * Add missing rules when NodePort support is disabled (#2026) * Add missing rules when NodePort support is disabled * the rules that need to be installed for NodePort support and SNAT support are very similar. The same traffic mark is needed for both. As a result, rules that are currently installed only when NodePort support is enabled should also be installed when external SNAT is disabled, which is the case by default. * remove "-m state --state NEW" from a rule in the nat table. This is always true for packets that traverse the nat table. * fix typo in one rule's name (extra whitespace). Fixes #2025 Co-authored-by: Quan Tian <qtian@vmware.com> Signed-off-by: Antonin Bas <abas@vmware.com> * Fix typos and unit tests Signed-off-by: Antonin Bas <abas@vmware.com> * Minor improvement to code comment Signed-off-by: Antonin Bas <abas@vmware.com> * Address review comments * Delete legacy nat rule * Fix an unrelated log message Signed-off-by: Antonin Bas <abas@vmware.com> Signed-off-by: Antonin Bas <abas@vmware.com> Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com> Co-authored-by: Sushmitha Ravikumar <58063229+sushrk@users.noreply.github.com> * downgrade test go.mod to align with root go.mod (#2128) * skip addon installation when addon info is not available (#2131) * Merging test/Makefile and test/go.mod to the root Makefil and go.mod, adjust the .github/workflows and integration test instructions (#2129) * update troubleshooting docs for CNI image (#2132) fix location where make command is run * fix env name in test script (#2136) * optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip (#2138) * optionally allow CLUSTER_ENDPOINT to be used rather than the kubernetes cluster ip * remove check for kube-proxy * add version to readme * Add resources config option to cni metrics helper (#2141) * Add resources config option to cni metrics helper * Remove default-empty resources block; replace with conditional * Add metrics for ec2 api calls made by CNI and expose via prometheus (#2142) Co-authored-by: Jay Deokar <jsdeokar@amazon.com> * increase workflow role duration to 4 hours (#2148) * Update golang 1.19.2 EKS-D (#2147) * Update golang * Move to EKS distro builds * [HELM]: Move CRD resources to a separate folder as per helm standard (#2144) Co-authored-by: Jay Deokar <jsdeokar@amazon.com> * VPC-CNI minimal image builds (#2146) * VPC-CNI minimal image builds * update dependencies for ginkgo when running integration tests * address review comments and break up init main function * review comments for sysctl * Simplify binary installation, fix review comments Since init container is required to always run, let binary installation for external plugins happen in init container. This simplifies the main container entrypoint and the dockerfile for each image. * when IPAMD connection fails, try to teardown pod network using prevResult (#2145) * add env var to enable nftables (#2155) * fix failing weekly cron tests (#2154) * Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER and remove no-op setter (#2153) * Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER * update release version comments Signed-off-by: Antonin Bas <abas@vmware.com> Co-authored-by: Jeffrey Nelson <jdnelson@amazon.com> Co-authored-by: Antonin Bas <antonin.bas@gmail.com> Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com> Co-authored-by: Sushmitha Ravikumar <58063229+sushrk@users.noreply.github.com> Co-authored-by: Jerry He <37866862+jerryhe1999@users.noreply.github.com> Co-authored-by: Brandon Wagner <wagnerbm@amazon.com> Co-authored-by: Jonathan Ogilvie <679297+jcogilvie@users.noreply.github.com> Co-authored-by: Jay Deokar <jsdeokar@amazon.com>
haouc
pushed a commit
to haouc/amazon-vpc-cni-k8s
that referenced
this pull request
Dec 13, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
bug
Which issue does this PR fix:
#2048
What does this PR do / Why do we need it:
Note that this PR replaces #2125
This PR resolves an issue in which IP rules were leaked by the CNI. When processing a pod deletion, the CNI would wait for IPAMD response before tearing down pod networking resources. If IPAMD could not be reached, CNI would return error and wait for kubelet to retry the delete. If IPAMD were restarted, the state for this pod would be cleared without CNI tearing down the associated networking resources. The trigger for the linked issue was the k8s cluster autoscaler evicting the
aws-node
daemonset pod before other pods and then later cancelling the pod evictions. kubernetes/autoscaler#5240 was filed to ask for k8s cluster autoscaler to change its behavior.The changes in this PR are two-fold:
There is a lot of duplication between
teardownPodNetworkWithPrevResult
andtryDelWithPrevResult
. I kept them separate to avoid unnecessarily complicatingtryDelWithPrevResult
and to make it clear thatteardownPodNetworkWithPrevResult
is a fallback mechanism.If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
Testing done on this change:
Added more test cases to
cni_test.go
and verified that all CNI and IPAMD integration tests pass with this change. Also manually verified the fix with IPv4 and IPv6 clusters.Automation added to e2e:
N/A
Will this PR introduce any new dependencies?:
No
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
This will not break upgrades or downgraded. A running cluster has been tested.
Does this change require updates to the CNI daemonset config files to work?:
No
Does this PR introduce any user-facing change?:
No
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.