Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPAMD optimizations and makefile changes #1975

Merged
merged 12 commits into from
Aug 15, 2022
Merged

Conversation

jayanthvn
Copy link
Contributor

@jayanthvn jayanthvn commented Apr 21, 2022

What type of PR is this?
bug

Which issue does this PR fix:
Fixes #1834
Fixes #1872

What does this PR do / Why do we need it:

  1. During reconciling or even startup, IPAMD calls tryAssignCidr and adds the IPs/Prefixes to datastore once EC2 allocates the IPs and doesn't wait for IMDS to sync. In scenarios where IMDS is out of sync we will end up tagging those pods as stale during restart. Hence similar to Alloc path, we need to wait until IMDS sync is done.

  2. While we wait for the IPs to be present in IMDS, we check if the requested CIDRs are available but this can be more than what the ENI supports and we end up waiting in the retry loop. Instead we need to request for min of {IPs/Prefixes needed, max supported on ENI}

  3. Make file grouping.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
#1834 #1872

Testing done on this change:

make help

Usage:
  make <target>

Building
  all              Builds Init, CNI and metrics helper container images.
  build-linux      Build the VPC CNI plugin agent using the host's Go toolchain.
  docker           Build VPC CNI plugin & agent container image.
  docker-init      Build VPC CNI plugin Init container image.
  docker-func-test  Run the built CNI container image to use in functional testing
  multi-arch-cni-build-push  Build multi-arch VPC CNI container image.
  multi-arch-cni-init-build-push  Build VPC CNI plugin Init container image.

Run Unit Tests
  unit-test        Run unit tests
  unit-test-race   Run unit tests with race detection (can only be run natively)

Build and Run Unit Tests
  build-docker-test  Build the unit test driver container image.
  docker-unit-tests  Run unit tests inside of the testing container image.

Build metrics helper agent
  build-metrics    Build metrics helper agent.
  docker-metrics   Build metrics helper agent Docker image.

Run metrics helper Unit Tests
  metrics-unit-test  Run metrics helper unit test suite (must be run natively).
  docker-metrics-test  Run metrics helper unit test suite in a container.
  plugins          Fetch the CNI plugins

Debug script
  debug-script     Fetching debug script from awslabs/amazon-eks-ami

Formatting
  check            Run all source code checks.
  lint             Run golint on source code.
  vet              Run go vet on source code.
  docker-vet       Run go vet inside of a container.
  format           Format all Go source code files. (Note! integration_test.go has an upstream import dependency that doesn't match)

Generate ENI/IP limits
  generate-limits  Generate limit file go code

Cleanup
  clean            Clean temporary files and build artifacts from the project.

Helpers
  help             Display this help
PASS
coverage: 100.0% of statements
ok  	github.com/aws/amazon-vpc-cni-k8s/pkg/utils/retry	0.002s	coverage: 100.0% of statements
?   	github.com/aws/amazon-vpc-cni-k8s/pkg/utils/ttime	[no test files]
?   	github.com/aws/amazon-vpc-cni-k8s/pkg/utils/ttime/mocks	[no test files]
?   	github.com/aws/amazon-vpc-cni-k8s/pkg/version	[no test files]

Automation added to e2e:

Yes

Will this PR introduce any new dependencies?:

No, EC2 calls will reduce since reconciler on adding IPs would make an ec2 call now instead it will wait for IMDS sync.

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No

Does this change require updates to the CNI daemonset config files to work?:

No

Does this PR introduce any user-facing change?:

No


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jayanthvn jayanthvn requested a review from a team as a code owner April 21, 2022 22:51
@jayanthvn jayanthvn requested a review from achevuru April 21, 2022 22:58
orsenthil
orsenthil previously approved these changes Apr 21, 2022
Copy link
Member

@orsenthil orsenthil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jayanthvn
Copy link
Contributor Author

Thanks Senthil. I am running few UTs, will update it and the merge.

pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
@github-actions
Copy link

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Jun 25, 2022
@jayanthvn
Copy link
Contributor Author

/not stale. Review pending

@github-actions github-actions bot removed the stale Issue or PR is stale label Jun 26, 2022
@jayanthvn
Copy link
Contributor Author

jayanthvn commented Jul 20, 2022

Ran a simple pod churn test -

1 node, replica 1 to 30 to 0 and repeated 10 runs. WARM IP = 4 and MIN IP = 10. Describe interface calls reduced quite significantly -

With 1.11.2 released image ->

Screen Shot 2022-07-20 at 1 36 38 PM

With the fix ->

Screen Shot 2022-07-20 at 1 51 13 PM

API Calls covering both the above timelines ->

Screen Shot 2022-07-20 at 1 52 01 PM

I will be running few more tests to slow down the pod churn and measure the performance. Also will be checking the pod launch time since now we would rely on IMDS sync on increasing IP pool.

@jayanthvn
Copy link
Contributor Author

Adding some delay between pod churns looks good too -

With 1.11.2

Screen Shot 2022-07-20 at 2 46 28 PM

With fix

Screen Shot 2022-07-20 at 4 12 59 PM

pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
@jayanthvn
Copy link
Contributor Author

This will introduce a bit of delay in the pod launch time during churn scenarios since we rely on IPs to be available in IMDS. So I will be splitting the PR, one with make file changes and reducing the describe call and another with just the dependency on IMDS for #1967 which might need more analysis.

@M00nF1sh
Copy link
Contributor

/lgtm
/approve

@M00nF1sh M00nF1sh self-requested a review August 15, 2022 22:53
@jayanthvn jayanthvn merged commit 99b14af into aws:master Aug 15, 2022
jayanthvn added a commit that referenced this pull request Aug 16, 2022
* 1.10.3 release artifacts (#1962)

* Stale PR and issue cleanup wrkflow (#1964)

* fix image name during build (#1968)

* add event recorder utils to raise aws-node pod events (#1536)

* refactor uploader scripts (#1972)

* Fix cni panic due to pod.Annotations is a nil map (#1974)

Co-authored-by: Relk Li <relk@maicoin.com>

* chart: Add extraVolumes and extraVolumeMounts (#1949)

Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>

* Add the new command in the section of CNI Plugin Sequence (#1813)

Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>

* Bump github.com/containernetworking/cni from 0.8.0 to 0.8.1 (#1966)

Bumps [github.com/containernetworking/cni](https://github.com/containernetworking/cni) from 0.8.0 to 0.8.1.
- [Release notes](https://github.com/containernetworking/cni/releases)
- [Commits](containernetworking/cni@v0.8.0...v0.8.1)

---
updated-dependencies:
- dependency-name: github.com/containernetworking/cni
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>

* Update README to highlight containerd.sock edge case with EKS AMI. (#1884)

* Update README to highlight containerd.sock edge case with EKS AMI.

* Updated Instructions as per review.

* add cni release test script (#1971)

* Multus release manifest (#1984)

* release manifest for Multus v3.8.0-eksbuild.1

* minor change to Readme

* Added Tests for validating Multus Installation (#1811)

* Added Tests for validating Multus Installation

Added missing files

Refactored code
Tried to make it modular and extensible.

* Deleted redundant file

* Fixed compilation issues

* fixed minor error

* Added script to trigger Multus tests (will be used by prow job)

* remove multus installation logic from ginkgo

* remove redundant changes

* Cleaned up run-multus-tests helper script

* Updated Readme for running multus tests
Added few checks in canary helper script

* revert changes to canary.sh

* Pass tag as an argument

* Updated Readme

* Updated tag for multus tests to use latest image

* Port new integration tests (#1928)

* Minor changes to run-integration-tests
Added integration-new framework tests

* Modified run-integration-tests to use new integration tests

* reverted redundant changes

* Merge integration with integration-new

* increase timeout (#1985)

fix syntax for ginkgo-v2

* Added configurable flag to create test nodes with arm64 and containerd runtime (#1977)

* Cleanup binary file (#1987)

* log error in ipamd on api server timeout (#1988)

* Refactored code and Added cni addon upgrade/downgrade regression test (#1861)

* Refactored code
Addon upgrade/downgrade test similar to #1795

Added tests for addon upgrade/downgrade

Changed DEFAULT version
Added addon status checks

Fetch latest addon version for given K8s Cluster

Update kops cluster config used in weekly tests (#1862)

* Change to kops cluster creation scripts

* Add logging for retry attempt

* Switch kops cluster to use docker container runtime

Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>

Renamed package name for adddon tests

removed unnecessary changes
Fixed replica count for MTU and Veth test in host networking

Updated ENI/IP limits file for newly added instances (#1864)

* Added new instances

* Updated test readme

* needed rebase

* formatting

* remove all references to integration-new
migrate to ginkgo v2 in addon test files

* fix maxIPPerInterface count on pod_networking_suite

* Increase default deployment ready timeout

Co-authored-by: Vikas Basavaraj <5373156+vikasmb@users.noreply.github.com>

* Remove generation of calico manifests (#1905)

* cni manifest upgrade downgrade test (#1863)

* Added upgrade/downgrade script template

Refactored code
Addon upgrade/downgrade test similar to #1795

Added tests for addon upgrade/downgrade

Changed DEFAULT version
Added addon status checks

Fetch latest addon version for given K8s Cluster

Update kops cluster config used in weekly tests (#1862)

* Change to kops cluster creation scripts

* Add logging for retry attempt

* Switch kops cluster to use docker container runtime

Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>

Added upgrade/downgrade test for custom cni-manifest-file

Added missing files

remove upgrade-downgrade.sh

* Add eks.go file , deleted by mistake

* Extract apply manifest logic in common
Remove redundant code

* Add PD traffic test for cni upgrade downgrade test

* Update golang to Go 1.18 (#1991)

* Update CNI Plugins to v1.1.1 (#1997)

* Update release manifests for VPC CNI v1.11.2 (#2001) (#2002)

* Enable Calico on ARM64 and add configureable flags for Calico installation (#2004)

* Enable Calico on ARM64 and add configureable flags for Calico
installation

* Add v to Calico version in release test script

* fix integration test script (#1998)

* Updated dependencies (#2012)

* Fix readme (#2013)

* Added upgrade/downgrade script template

Refactored code
Addon upgrade/downgrade test similar to #1795

Added tests for addon upgrade/downgrade

Changed DEFAULT version
Added addon status checks

Fetch latest addon version for given K8s Cluster

Update kops cluster config used in weekly tests (#1862)

* Change to kops cluster creation scripts

* Add logging for retry attempt

* Switch kops cluster to use docker container runtime

Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>

Added upgrade/downgrade test for custom cni-manifest-file

Added missing files

remove upgrade-downgrade.sh

* Add eks.go file , deleted by mistake

* Extract apply manifest logic in common
Remove redundant code

* Add PD traffic test for cni upgrade downgrade test

* Updated Readme

* Merge fix-ginkgo to master (#2014)

* fix path failure

* seperate makefile for test

Co-authored-by: abhipth <abhipth@amazon.com>

* Multus manifest for release v3.9.0-eksbuild.1 (#2016)

* Updating new instances - p4de (#2018)

* Updating new instances

* fix formatting

* Fix go build failure with v6 networking suite. (#2020)

* Update README.md (#2021)

* Fix Go build for ipamd test package. (#2023)

* Fix Go build for ipamd test package.

* Fix format with make format

* Fix go build for cni test package. (#2024)

* Prevent allocate/free ENIs when node is marked noSchedule (#1927)

* Prevent allocate/free ENIs when node is marked noSchedule

* Update UTs

* Re-use logger instance (#2029)

* Re-use logger instance

- Existing logger initialization constructed different logger
  instances upon call to Get() method.
- Fixed the initailiation logic to re-use the logger instance.

* Added unit tests for logger initialization fix

* fix addOn version api for beta (#2034)

* Update yaml.v3 package dependency (#2036)

* Update yaml.v3 package dependency

* Increase cpu requests limit (#2038)

- Porting changes from release-1.10 branch made in PR #1749

* fix ipamd integration failures and cleanup (#2039)

* fix integration test failures and cleanup

* README update

* cleanup info logs in event recorder and test script (#2043)

* add nodeSelector to cni-metrics-helper test deployment and update image tag (#2047)

* fix makefile path in canary test script (#2051)

* disable arm build (#2052)

* Updated changelog for 1.11.3 release (#2053)

Co-authored-by: Vikas Basavaraj Mallapura <“5373156+vikasmb@users.noreply.github.com”>

* Updating master branch config files to release 1.11.3 (#2055)

* Updated changelog for 1.11.3 release

* Image tag and chart version update for 1.11.3 release (#2050)

Co-authored-by: Vikas Basavaraj Mallapura <“5373156+vikasmb@users.noreply.github.com”>
(cherry picked from commit ff42a83)

Co-authored-by: Vikas Basavaraj Mallapura <“5373156+vikasmb@users.noreply.github.com”>

* update aws-node clusterrole permissions (#2058)

* Fix minor typo on documentation (#2059)

s/varibales/variables/

* multus manifest for release v3.9.0-eksbuild.2 (#2057)

* Setting AWS_VPC_K8S_CNI_RANDOMIZESNAT to the default value (#2028)

Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>

* Fixing prefixes per ENI value in example (#2060)

Prefixes per ENI in row 6 should be 1 not 3.

Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>

* IPAMD optimizations and makefile changes (#1975)

* IPAMD optimizations and makefile changes

* Minor comments

* Removed IMDS dependency

* fix test

* fix test

* fix test-format

* Updated new instances (#2062)

* Updated new instances

* fix format

Co-authored-by: M00nF1sh <yyyng@amazon.com>
Co-authored-by: Sushmitha Ravikumar <58063229+sushrk@users.noreply.github.com>
Co-authored-by: Relk Li <YiJiun.Li.C@gmail.com>
Co-authored-by: Relk Li <relk@maicoin.com>
Co-authored-by: Jan-Otto Kröpke <github@jkroepke.de>
Co-authored-by: Shuntaro Azuma <azush.work@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Senthil Kumaran <senthilx@amazon.com>
Co-authored-by: cgchinmay <cgadgil@amazon.com>
Co-authored-by: Vikas Basavaraj <5373156+vikasmb@users.noreply.github.com>
Co-authored-by: Hao Zhou <haouc@users.noreply.github.com>
Co-authored-by: abhipth <abhipth@amazon.com>
Co-authored-by: Prasad Jivane <prasad.jivane@walchandsangli.ac.in>
Co-authored-by: Vikas Basavaraj Mallapura <“5373156+vikasmb@users.noreply.github.com”>
Co-authored-by: Guillaume Delacour <guillaume.delacour@gmail.com>
Co-authored-by: Venkata Gunapati <gvsukumar@gmail.com>
Co-authored-by: Muhammed Karakas <karakas@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WARM IP/MINIMUM IP greater than Max IPs per ENI delays adding ENI to datastore Makefile Build improvement
4 participants