Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI does not wait for ENI and IP to be available before reporting ready when setting MINIMUM_IP_TARGET and WARM_IP_TARGET #1724

Closed
dtnyn opened this issue Nov 4, 2021 · 23 comments
Assignees
Labels

Comments

@dtnyn
Copy link

dtnyn commented Nov 4, 2021

What happened:

After setting MINIMUM_IP_TARGET and WARM_IP_TARGET. When scheduled on a new node, pods are stuck on failed to assign an IP address to container for a few minutes before being able to be created.

From my observation I can see that after setting these override, aws-node does not wait for its datastore to be populated first before reporting ready causing pods to try to be scheduled onto these nodes and goes into a crashing state as no IP can be assigned

Pod log

Normal   SuccessfulAttachVolume  3m16s              attachdetach-controller  AttachVolume.Attach succeeded for volume "XXXXXXXX"
  Warning  FailedCreatePodSandBox  3m12s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "22cbd7067d76c2ed509436f65384e1df59ec8949e7910d7bbbfb32de8a511f89": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  3m                 kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "81a9fde8b5ea0da28959e8c08e08dc2ffcbe700c7f3067afd273e9ac67b37588": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m48s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a94f47242b45d3610a9bf25ee17de09d8fb6f77dcf1754deac13395020b1082f": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m37s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a6f2f67496e2c75cdc89963afc5d4f0b47fe3d6988d0356b4cd0e0ffb641400c": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m24s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "de2c20851923cc290e187ade06472c2f77595c5d55f7ee7522603fb55e9ee943": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m10s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4f3e15dbff1e9f6e2ad7055a61b75724d9ccd60850e05fe811966ca1cae65529": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  115s               kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a545894e2066ed715d5b3667c38bf9f6ecefd45c67e79da2e8016929e84c66a0": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  100s               kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9e76dd6f08280e3f246e1f366b8b0170abc345e814754797cea56e523e2f14c1": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  88s                kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5c6a3409855d54fd5f296e0793eafc84202a4e825a2cea86784b54bfb3220231": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  75s                kubelet                  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0be78bc1e43e44a0f14a503c9cfbb0fdd9570d5aab0a6db9244e129c3ef294a9": add cmd: failed to assign an IP address to container
  Normal   Pulling                 62s                kubelet                  Pulling image "XXXXXXXXXXX"

I have been monitoring the ready state for the pod and see that it's declared ready before ipamd is able to successfully add the ENI and its IP to the datastore

The pod status change

Thu Nov  4 11:43:31 AEDT 2021
aws-node-xnpcf                                                          0/1   Running             0     69s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>
Thu Nov  4 11:43:32 AEDT 2021
aws-node-xnpcf                                                          0/1   Running             0     70s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>
Thu Nov  4 11:43:33 AEDT 2021
aws-node-xnpcf                                                          1/1   Running             0     72s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>
Thu Nov  4 11:43:35 AEDT 2021
aws-node-xnpcf                                                          1/1   Running             0     73s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>

Readiness

date; k describe po aws-node-xnpcf | ag Ready:
Thu Nov  4 11:43:43 AEDT 2021
    Ready:          True
    Ready:          True

The node status change

ip-10-1-56-83.ap-southeast-2.compute.internal   NotReady   <none>   38s   v1.20.11
Thu Nov  4 11:43:01 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal   NotReady   <none>   39s   v1.20.11
Thu Nov  4 11:43:02 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal   Ready   <none>   41s   v1.20.11
Thu Nov  4 11:43:03 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal   Ready   <none>   42s   v1.20.11

Inside ipamd around that time when the pod and node are declared ready I can see pods are trying to be scheduled and assigned an IP but there's nothing in the datastore at the time

{"level":"info","ts":"2021-11-04T00:43:35.680Z","caller":"rpc/rpc.pb.go:501","msg":"Received AddNetwork for NS /var/run/netns/cni-42c1a3c7-8c69-e278-754d-efb63b6b33ea, Sandbox b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901, ifname eth0"}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"rpc/rpc.pb.go:501","msg":"AddNetworkRequest: ClientVersion:\"v1.9.0\" K8S_POD_NAME:\"xxxxxxxxxxx\" K8S_POD_NAMESPACE:\"xxxxxxxx\" K8S_POD_INFRA_CONTAINER_ID:\"b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901\" ContainerID:\"b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-42c1a3c7-8c69-e278-754d-efb63b6b33ea\" "}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignPodIPv4Address: ENI eni-0f7e7018aaead6bc1 does not have available addresses"}
{"level":"error","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"DataStore has no available IP/Prefix addresses"}

It is not until a while later that ipamd successfully populate the datastore

{"level":"info","ts":"2021-11-04T00:45:49.425Z","caller":"ipamd/ipamd.go:917","msg":"Added ENI(eni-06129a67d2970e562)'s IP/Prefix 10.7.31.56/32 to datastore"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:904","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, totalprefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:905","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, totalprefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}

During this time period between aws-node reporting as ready and when ipamd datastore is actually populated with IPs, pods are being scheduled on and failing which is undesirable.

What you expected to happen:

I expected a similar behaviour observed with the default WARM_ENI_TARGET=1 behaviour where the pod is not considered to be ready until the datastore adds the attached ENI is populated with IPs.

The following logs are from using the default option with no MINIMUM_IP_TARGET and WARM_IP_TARGET override. In this case aws-cni works as expected

Pod status

Thu Nov  4 15:55:28 AEDT 2021
aws-node-7xwp2                                                          0/1   Running             0     69s     10.1.56.123   ip-10-1-56-123.ap-southeast-2.compute.internal   <none>   <none>
Thu Nov  4 15:55:29 AEDT 2021
aws-node-7xwp2                                                          1/1   Running             0     70s     10.1.56.123   ip-10-1-56-123.ap-southeast-2.compute.internal   <none>   <none>
Thu Nov  4 15:55:31 AEDT 2021
aws-node-7xwp2                                                          1/1   Running             0     72s     10.1.56.123   ip-10-1-56-123.ap-southeast-2.compute.internal   <none>   <none>

Ready status

Thu Nov  4 15:55:26 AEDT 2021
    Ready:          True
    Ready:          False
Thu Nov  4 15:55:27 AEDT 2021
    Ready:          True
    Ready:          False
Thu Nov  4 15:55:29 AEDT 2021
    Ready:          True
    Ready:          True

I can see in the ipamd log that the datastore successfully populate itself with the ENI's IP around this time

{"level":"info","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:917","msg":"Added ENI(eni-035e6725b4d13e122)'s IP/Prefix 10.7.30.221/32 to datastore"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:904","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, total prefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:905","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, total prefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}

Environment:

  • Kubernetes version (use kubectl version): v1.20.11
  • CNI Version: v1.9.0
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04.3 LTS (Focal Fossa)
  • Kernel (e.g. uname -a): Linux xxxxxxxx 5.11.0-1019-aws #20~20.04.1-Ubuntu SMP Tue Sep 21 10:40:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
@dtnyn dtnyn added the bug label Nov 4, 2021
@jayanthvn
Copy link
Contributor

jayanthvn commented Nov 5, 2021

@dtnyn - What is your WARM_IP_TARGET and MINIMUM_IP_TARGET values? Do you have custom networking enabled?

With respect to WARM_IP_TARGET/MINIMUM_IP_TARGET and WARM_ENI_TARGET, there is no difference on how IPAMD starts up. We do wait for ipamd to start - https://github.com/aws/amazon-vpc-cni-k8s/blob/release-1.9/scripts/entrypoint.sh#L139 before copying cni config file - https://github.com/aws/amazon-vpc-cni-k8s/blob/release-1.9/scripts/entrypoint.sh#L145 and in ipamd init - we do try to assign IPs to the primary ENI - https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.9.0/pkg/ipamd/ipamd.go#L493. So before node ready typically we should have primary ENI with ips in non-custom networking mode.

Once you share the configs, we can try to verify this scenario locally.

@dtnyn
Copy link
Author

dtnyn commented Nov 5, 2021

These are the config enabled for the CNI

Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s
Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s
Environment:
  AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG:  true
  AWS_VPC_K8S_CNI_EXTERNALSNAT:        true
  ENI_CONFIG_LABEL_DEF:                awscni
  AWS_VPC_K8S_CNI_VETHPREFIX:          eni
  AWS_VPC_K8S_CNI_LOGLEVEL:            DEBUG
  AWS_VPC_K8S_PLUGIN_LOG_LEVEL:        DEBUG
  WARM_IP_TARGET:                      5
  MINIMUM_IP_TARGET:                   25
  MY_NODE_NAME:                         (v1:spec.nodeName)
  WATCH_NAMESPACE:                     xxxxxx (v1:metadata.namespace)

This is used on both m5.xlarge and m5a.xlarge both of which the difference behaviour between overriding WARM_ENI_TARGET can be observed.

@jayanthvn
Copy link
Contributor

Thanks for sharing the details. Since you mentioned this is a new node, the instance will have only the primary ENI attached and during aws-node initialization IPs won't be assigned to the primary ENI since custom networking is enabled. The secondary ENI will be allocated by the reconciler based on the WARM targets requirement since available IPs will be 0. Before the reconciler thread starts node can be marked as ready and datastore will not have any IPs. But we will try both the configs locally and get back to you.

@dtnyn
Copy link
Author

dtnyn commented Nov 11, 2021

Maybe a better description of the problem would be the difference between the init time and the time when the datastore is first populated with. We're seeing an average wait of up to 3 minutes when WARM_IP_TARGET override is enabled compared to the average 30-60 seconds of the default WARM_ENI_TARGET=1 on m5.xlarge

During that 3 waits minutes pods are stuck pending for an extended period of time since there's no IP to be allocated to them. I do sometimes see pods waiting on the default configuration with WARM_ENI_TARGET but since the instantiation time is so fast it's very much negligible

Examples:

WARM_IP_TARGET=5 MINIMUM_IP_TARGET=25

Node 1:

{"level":"info","ts":"2021-11-11T05:40:03.932Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T05:40:03.934Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T05:43:04.337Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T05:43:04.337Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"debug","ts":"2021-11-11T05:43:04.337Z","caller":"ipamd/ipamd.go:1039","msg":"Current warm IP stats: target: 5, total: 14, assigned: 0, available: 14, short: 11, over 0"}

Node 2:

{"level":"info","ts":"2021-11-11T05:40:02.882Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T05:40:02.883Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T05:43:06.767Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T05:43:06.767Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"debug","ts":"2021-11-11T05:43:06.767Z","caller":"ipamd/ipamd.go:1039","msg":"Current warm IP stats: target: 5, total: 14, assigned: 0, available: 14, short: 11, over 0"}

Node 3:

{"level":"info","ts":"2021-11-11T05:40:03.974Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T05:40:03.975Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T05:43:05.236Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T05:43:05.236Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"debug","ts":"2021-11-11T05:43:05.236Z","caller":"ipamd/ipamd.go:1039","msg":"Current warm IP stats: target: 5, total: 14, assigned: 0, available: 14, short: 11, over 0"}

WARM_ENI_TARGET=1 (Default)

Node 1:

{"level":"info","ts":"2021-11-11T06:12:10.002Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T06:12:10.003Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T06:12:21.139Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T06:12:21.139Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"info","ts":"2021-11-11T06:12:22.510Z","caller":"rpc/rpc.pb.go:501","msg":"Received AddNetwork for NS /var/run/netns/cni-4b0995b0-dbd8-f4f9-1e56-14919ad2318a, Sandbox 82ea78f03a231293bf67c0d28b7219378089f644c4d8e69ac052c21dc2f942a7, ifname eth0"}
{"level":"debug","ts":"2021-11-11T06:12:22.511Z","caller":"rpc/rpc.pb.go:501","msg":"AddNetworkRequest: ClientVersion:\"v1.9.0\" K8S_POD_NAME:\"xxxxxxx\" K8S_POD_NAMESPACE:\"xxxxxxx\" K8S_POD_INFRA_CONTAINER_ID:\"82ea78f03a231293bf67c0d28b7219378089f644c4d8e69ac052c21dc2f942a7\" ContainerID:\"82ea78f03a231293bf67c0d28b7219378089f644c4d8e69ac052c21dc2f942a7\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-4b0995b0-dbd8-f4f9-1e56-14919ad2318a\" "}
{"level":"debug","ts":"2021-11-11T06:12:22.511Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignIPv4Address: IP address pool stats: total: 14, assigned 0"}

Node 2:

{"level":"info","ts":"2021-11-11T06:12:09.881Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T06:12:09.882Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T06:12:20.717Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T06:12:20.717Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"info","ts":"2021-11-11T06:12:22.210Z","caller":"rpc/rpc.pb.go:501","msg":"Received AddNetwork for NS /var/run/netns/cni-57a72a54-6f0e-7d8a-c423-bc826298d8fa, Sandbox 330d515564bd9e0382c8b786449771e67b27bb58fccb7030f5ed5b33ca55edb9, ifname eth0"}
{"level":"debug","ts":"2021-11-11T06:12:22.210Z","caller":"rpc/rpc.pb.go:501","msg":"AddNetworkRequest: ClientVersion:\"v1.9.0\" K8S_POD_NAME:\"xxxxxxx\" K8S_POD_NAMESPACE:\"xxxxxxx\" K8S_POD_INFRA_CONTAINER_ID:\"330d515564bd9e0382c8b786449771e67b27bb58fccb7030f5ed5b33ca55edb9\" ContainerID:\"330d515564bd9e0382c8b786449771e67b27bb58fccb7030f5ed5b33ca55edb9\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-57a72a54-6f0e-7d8a-c423-bc826298d8fa\" "}
{"level":"debug","ts":"2021-11-11T06:12:22.210Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignIPv4Address: IP address pool stats: total: 14, assigned 0"}

Node 3:

{"level":"info","ts":"2021-11-11T06:12:10.108Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T06:12:10.109Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T06:12:24.631Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T06:12:24.631Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"info","ts":"2021-11-11T06:12:34.864Z","caller":"rpc/rpc.pb.go:501","msg":"Received AddNetwork for NS /var/run/netns/cni-285be1d5-8ad2-7a0f-29f3-0d1c91a6efb0, Sandbox 1ece564a1c9ea4e7c311950efda30949538847f30a98eabf90fa3651c3b6adfd, ifname eth0"}
{"level":"debug","ts":"2021-11-11T06:12:34.864Z","caller":"rpc/rpc.pb.go:501","msg":"AddNetworkRequest: ClientVersion:\"v1.9.0\" K8S_POD_NAME:\"xxxxxxx\" K8S_POD_NAMESPACE:\"xxxxxxx\" K8S_POD_INFRA_CONTAINER_ID:\"1ece564a1c9ea4e7c311950efda30949538847f30a98eabf90fa3651c3b6adfd\" ContainerID:\"1ece564a1c9ea4e7c311950efda30949538847f30a98eabf90fa3651c3b6adfd\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-285be1d5-8ad2-7a0f-29f3-0d1c91a6efb0\" "}
{"level":"debug","ts":"2021-11-11T06:12:34.864Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignIPv4Address: IP address pool stats: total: 14, assigned 0"}

@sidewinder12s
Copy link

@dtnyn Have you given the new IP prefix's a try/do they have similar delay? I expect I might be experiencing similar delays as we have the same type of config in our clusters.

@jayanthvn
Copy link
Contributor

I did a test on my local cluster to compare the time difference between init time and when datastore has IPs and didn't find any difference with respect to the configs -

WARM_ENI_TARGET = 1

Init start ->

{"level":"info","ts":"2022-01-26T06:30:00.329Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.3  ..."}

Reconciler increasing the pool and added a secondary ENI ->

{"level":"debug","ts":"2022-01-26T06:30:05.052Z","caller":"ipamd/ipamd.go:628","msg":"IP pool is too low: available (0) < ENI target (1) * addrsPerENI (3)"}
{"level":"debug","ts":"2022-01-26T06:30:05.052Z","caller":"ipamd/ipamd.go:629","msg":"Starting to increase pool size"}
{"level":"debug","ts":"2022-01-26T06:30:05.052Z","caller":"ipamd/ipamd.go:899","msg":"Skip the primary ENI for need IP check"}
{"level":"info","ts":"2022-01-26T06:30:05.052Z","caller":"eniconfig/eniconfig.go:68","msg":"Get Node Info for: ip-192-168-30-100.us-west-2.compute.internal"}
{"level":"debug","ts":"2022-01-26T06:30:05.052Z","caller":"eniconfig/eniconfig.go:129","msg":"Using ENI_CONFIG_LABEL_DEF failure-domain.beta.kubernetes.io/zone"}
{"level":"info","ts":"2022-01-26T06:30:05.052Z","caller":"ipamd/ipamd.go:806","msg":"Found ENI Config Name: us-west-2a"}
{"level":"info","ts":"2022-01-26T06:30:05.153Z","caller":"ipamd/ipamd.go:780","msg":"ipamd: using custom network config: [], subnet-0eeb2ea845de1c8d3"}
{"level":"info","ts":"2022-01-26T06:30:05.153Z","caller":"awsutils/awsutils.go:689","msg":"Using a custom network config for the new ENI"}
{"level":"warn","ts":"2022-01-26T06:30:05.153Z","caller":"awsutils/awsutils.go:689","msg":"No custom networking security group found, will use the node's primary ENI's SG: [%!s(*string=0xc0006cd760) %!s(*string=0xc0006cd770)]"}
{"level":"info","ts":"2022-01-26T06:30:05.153Z","caller":"awsutils/awsutils.go:689","msg":"Creating ENI with security groups: [sg-04123967cfc7fea84 sg-0c72b30626f1cbaf3] in subnet: subnet-0eeb2ea845de1c8d3"}
{"level":"info","ts":"2022-01-26T06:30:05.608Z","caller":"awsutils/awsutils.go:689","msg":"Created a new ENI: eni-07c415ca2e05f9b9d"}
{"level":"debug","ts":"2022-01-26T06:30:05.709Z","caller":"awsutils/awsutils.go:732","msg":"Discovered device number is used: 0"}
{"level":"debug","ts":"2022-01-26T06:30:05.709Z","caller":"awsutils/awsutils.go:732","msg":"Found a free device number: 1"}
{"level":"info","ts":"2022-01-26T06:30:07.040Z","caller":"ipamd/ipamd.go:821","msg":"Successfully created and attached a new ENI eni-07c415ca2e05f9b9d to instance"}
{"level":"info","ts":"2022-01-26T06:30:07.040Z","caller":"ipamd/ipamd.go:830","msg":"Trying to allocate 3 IP addresses on ENI eni-07c415ca2e05f9b9d"}
{"level":"debug","ts":"2022-01-26T06:30:07.040Z","caller":"ipamd/ipamd.go:830","msg":"PD enabled - false"}
{"level":"info","ts":"2022-01-26T06:30:07.535Z","caller":"ipamd/ipamd.go:830","msg":"Allocated 3 private IP addresses"}

WARM_IP_TARGET = 1 and MINIMUM_IP_TARGET = 3

Init start ->

{"level":"info","ts":"2022-01-26T06:57:17.253Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.3  ..."}

Reconciler increasing the pool and added a secondary ENI ->

{"level":"debug","ts":"2022-01-26T06:57:21.208Z","caller":"ipamd/ipamd.go:629","msg":"Starting to increase pool size"}
{"level":"debug","ts":"2022-01-26T06:57:21.208Z","caller":"ipamd/ipamd.go:737","msg":"Current warm IP stats: target: 1, total: 0, assigned: 0, available: 0, short: 3, over 0"}
{"level":"debug","ts":"2022-01-26T06:57:21.208Z","caller":"ipamd/ipamd.go:862","msg":"Current warm IP stats: target: 1, total: 0, assigned: 0, available: 0, short: 3, over 0"}
{"level":"debug","ts":"2022-01-26T06:57:21.208Z","caller":"ipamd/ipamd.go:886","msg":"Current warm IP stats: target: 1, total: 0, assigned: 0, available: 0, short: 3, over 0"}
{"level":"debug","ts":"2022-01-26T06:57:21.208Z","caller":"ipamd/ipamd.go:899","msg":"Skip the primary ENI for need IP check"}
{"level":"info","ts":"2022-01-26T06:57:21.208Z","caller":"eniconfig/eniconfig.go:68","msg":"Get Node Info for: ip-192-168-33-140.us-west-2.compute.internal"}
{"level":"debug","ts":"2022-01-26T06:57:21.209Z","caller":"eniconfig/eniconfig.go:129","msg":"Using ENI_CONFIG_LABEL_DEF failure-domain.beta.kubernetes.io/zone"}
{"level":"info","ts":"2022-01-26T06:57:21.209Z","caller":"ipamd/ipamd.go:806","msg":"Found ENI Config Name: us-west-2b"}
{"level":"info","ts":"2022-01-26T06:57:21.309Z","caller":"ipamd/ipamd.go:780","msg":"ipamd: using custom network config: [], subnet-0747217e1d86e6510"}
{"level":"info","ts":"2022-01-26T06:57:21.309Z","caller":"awsutils/awsutils.go:689","msg":"Using a custom network config for the new ENI"}
{"level":"warn","ts":"2022-01-26T06:57:21.309Z","caller":"awsutils/awsutils.go:689","msg":"No custom networking security group found, will use the node's primary ENI's SG: [%!s(*string=0xc00024c5e0) %!s(*string=0xc00024c5f0)]"}
{"level":"info","ts":"2022-01-26T06:57:21.309Z","caller":"awsutils/awsutils.go:689","msg":"Creating ENI with security groups: [sg-04123967cfc7fea84 sg-0c72b30626f1cbaf3] in subnet: subnet-0747217e1d86e6510"}
{"level":"info","ts":"2022-01-26T06:57:21.723Z","caller":"awsutils/awsutils.go:689","msg":"Created a new ENI: eni-052e27ef0393de5ca"}
{"level":"debug","ts":"2022-01-26T06:57:21.824Z","caller":"awsutils/awsutils.go:732","msg":"Discovered device number is used: 0"}
{"level":"debug","ts":"2022-01-26T06:57:21.824Z","caller":"awsutils/awsutils.go:732","msg":"Found a free device number: 1"}
{"level":"info","ts":"2022-01-26T06:57:23.135Z","caller":"ipamd/ipamd.go:821","msg":"Successfully created and attached a new ENI eni-052e27ef0393de5ca to instance"}
{"level":"debug","ts":"2022-01-26T06:57:23.136Z","caller":"ipamd/ipamd.go:1899","msg":"Current warm IP stats: target: 1, total: 0, assigned: 0, available: 0, short: 3, over 0"}
{"level":"info","ts":"2022-01-26T06:57:23.136Z","caller":"ipamd/ipamd.go:830","msg":"Trying to allocate 3 IP addresses on ENI eni-052e27ef0393de5ca"}
{"level":"debug","ts":"2022-01-26T06:57:23.136Z","caller":"ipamd/ipamd.go:830","msg":"PD enabled - false"}
{"level":"info","ts":"2022-01-26T06:57:23.554Z","caller":"ipamd/ipamd.go:830","msg":"Allocated 3 private IP addresses"}

Both the cases within 4-5 seconds reconciler tried to increase the pool. There is no difference in the code path for both the cases. Once the reconciler starts based on the WARM targets pool will be increased.

But I feel there is one optimization that can be done here, since you are using custom networking and primary ENI cannot have IPs/Prefixes attached we should go ahead and attach secondary ENI based on the WARM targets and that should help with the pod launch time otherwise we will have to wait for the reconciler to increase the pool and node will be marked as ready.

@dtnyn
Copy link
Author

dtnyn commented Jan 27, 2022

@sidewinder12s We have tried prefix mode but did not do extended testing due to the minimum allocation size of /28 which doesn't fit our use-case.

@jayanthvn I believe that your proposed optimization, attaching a new ENI with pre-allocated IP, will best for our current problem. Since we can observe that the ENI attachment and the time starting the reconcile are very fast. Only the process of the initial reconiller adding new IP to the ENI that we see is very slow. Any subsequent reconciller processes on the the same ENI is fast.

@jayanthvn
Copy link
Contributor

@dtnyn - Yes that's right. I have the PR and testing it out locally.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Apr 29, 2022
@CameronHall
Copy link

@jayanthvn we're experiencing this issue as well. How has your local testing been going?

@jayanthvn
Copy link
Contributor

I am waiting for the code review.

@github-actions github-actions bot removed the stale Issue or PR is stale label May 6, 2022
@leofernandezg
Copy link

+1

@sidewinder12s
Copy link

Any updates on PR Progress?

@jayanthvn
Copy link
Contributor

We had discussed few changes for the PR but couldn't progress because of other commitments. I will take it up next week for changes/review and track it for 1.11.5 or 1.12.0 release. Will provide the release timelines once the PR is merged.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Nov 12, 2022
@sidewinder12s
Copy link

Not stale. We still regularly see this issue.

@github-actions github-actions bot removed the stale Issue or PR is stale label Nov 15, 2022
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Jan 15, 2023
@sidewinder12s
Copy link

sidewinder12s commented Jan 16, 2023

Not stale

@github-actions github-actions bot removed the stale Issue or PR is stale label Jan 17, 2023
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Mar 19, 2023
@sidewinder12s
Copy link

Not stale.

@github-actions github-actions bot removed the stale Issue or PR is stale label Mar 21, 2023
@pmvrak
Copy link

pmvrak commented Mar 21, 2023

I faced the same issue while creating EKS cluster with custom networking with prefix delegation and I got below error -

{"level":"error","ts":"2023-03-21T16:32:26.571Z","caller":"ipamd/ipamd.go:845","msg":"Failed to increase pool size due to not able to allocate ENI AllocENI:error attaching ENI: attachENI:

failed to attach ENI: InvalidParameterCombination: You may not attach a network interface to an instance if they are not in the same availability zone\n\tstatus code: 400, request id: 8160c309-fa43-4b0f-95f2-45b3b5100ed7"}

{"level":"debug","ts":"2023-03-21T17:26:17.714Z","caller":"datastore/data_store.go:646","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}

{"level":"debug","ts":"2023-03-21T17:26:17.714Z","caller":"datastore/data_store.go:646","msg":"AssignPodIPv4Address: ENI eni-0b881d8fd9bae3c92 does not haveavailable addresses"}

I followed docs here and here and I have available Ip's in the subnets and the parameters WARM_IP_TARGET and AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG are set properly.

when I have /16 subnets , it works fine it seems and when I have primary node subnets with /24 and secondary ICDR subnets with /18 , then I get this issue.

@jdn5126
Copy link
Contributor

jdn5126 commented May 12, 2023

Closing as fixed by #2354. This will ship in the next VPC CNI release

@jdn5126 jdn5126 closed this as completed May 12, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants