ImagePullBackOff: Failed to pull image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3" #1513

sanjeebsarangi · 2020-03-16T19:53:58Z

What happened:
ErrImagePull error for for NMI pod after node reboot.

What you expected to happen:
All pod should come back to running state after rebooting one or more AKS node.

How to reproduce it (as minimally and precisely as possible):
Rebooting an AKS node.

Anything else we need to know?:
We have 7 node cluster and we rebooted a node from Azure Console. NMI pod is not coming up with error below. This also happened with another pod in ACR in our own subscription.

`Events:
Type Reason Age From Message

Warning FailedCreatePodSandBox 45m (x71 over 61m) kubelet, aks-fastcompute-25023122-vmss000002 Failed create pod sandbox: rpc error: code = Unknown desc = failed pulling image "mcr.microsoft.com/k8s/core/pause:1.2.0": Error response from daemon: Get https://mcr.microsoft.com/v2/: dial tcp: lookup mcr.microsoft.com on 168.63.129.16:53: no such host
Normal SandboxChanged 39m kubelet, aks-fastcompute-25023122-vmss000002 Pod sandbox changed, it will be killed and re-created.
Warning Failed 38m (x3 over 39m) kubelet, aks-fastcompute-25023122-vmss000002 Failed to pull image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3": rpc error: code = Unknown desc = Error response from daemon: Get https://mcr.microsoft.com/v2/: dial tcp: lookup mcr.microsoft.com on 168.63.129.16:53: no such host
Warning Failed 38m (x3 over 39m) kubelet, aks-fastcompute-25023122-vmss000002 Error: ErrImagePull
Warning Failed 37m (x7 over 39m) kubelet, aks-fastcompute-25023122-vmss000002 Error: ImagePullBackOff
Normal Pulling 37m (x4 over 39m) kubelet, aks-fastcompute-25023122-vmss000002 Pulling image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3"
Normal BackOff 3m52s (x157 over 39m) kubelet, aks-fastcompute-25023122-vmss000002 Back-off pulling image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3"
`

Environment:

Kubernetes version: 1.14.8
Size of cluster (how many worker nodes are in the cluster?): 7
General description of workloads in the cluster Machine learning
Others:

jnoller · 2020-03-16T20:32:00Z

Hi @sanjeebsarangi - thanks for following up, as noted in email, this is the same issue as #1373 - the imagepullbackoff is triggered by the IOPS load of the container image pulls on reboots and node failovers.

cdunford · 2020-07-03T13:17:18Z

@jnoller is there a plan to address this in some way? It seems like a severe limitation if we cannot expect nodes to restart successfully

ghost · 2020-08-02T19:01:19Z

Action required from @Azure/aks-pm

ryanmcafee · 2020-08-06T18:40:50Z

Any update on this? I have clusters provisioned using Standard_D4as_v4 nodes and 256GB premium ssds and am seeing this issue when provisioning aad-pod-identity with Terraform. @jnoller What is the expected IOPS load of the container image pull?

ryanmcafee · 2020-08-06T18:43:56Z

Could this be related to using remote network disks for the OS nodes? Would Microsoft recommend using vms in the Kubernetes node pool which support local premium disks, rather then relying on remote storage?

palma21 · 2020-08-06T23:56:58Z

The above is a DNS issue, not IOPS

what is your container runtime version? Did you happen to open a support ticket for this?

CC @aramase @cpuguy83 @ritazh

ryanmcafee · 2020-08-07T23:49:31Z

@palma21 I think it's related as @jnoller noted in: #1373 where it's noted that disk io saturation and throttling cause the cluster dns to fail and yes I created a ticket for this issue 120080724005711. I'm testing out Velero before rolling out this change to our additional Azure subscriptions/ aks clusters.

palma21 · 2020-08-07T23:53:30Z

Your case might be as I don't see your error, the OP case is not, the daemon is working fine and not throttled but not being able to find the registry IP. That doesn't even use cluster DNS. 168.63.129.16 is Azure DNS.

ghost · 2020-10-07T02:01:14Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

ghost · 2020-10-22T07:01:09Z

This issue will now be closed because it hasn't had any activity for 15 days after stale. sanjeebsarangi feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

triage-new-issues bot added the triage label Mar 16, 2020

jnoller added iops and removed triage labels Mar 16, 2020

ghost added the action-required label Jul 28, 2020

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Aug 2, 2020

palma21 added Feedback General feedback upstream and removed Needs Attention 👋 Issues needs attention/assignee/owner action-required iops labels Aug 6, 2020

ghost added the stale Stale issue label Oct 7, 2020

ghost closed this as completed Oct 22, 2020

ghost locked as resolved and limited conversation to collaborators Nov 21, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImagePullBackOff: Failed to pull image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3" #1513

ImagePullBackOff: Failed to pull image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3" #1513

sanjeebsarangi commented Mar 16, 2020

jnoller commented Mar 16, 2020

cdunford commented Jul 3, 2020

ghost commented Aug 2, 2020

ryanmcafee commented Aug 6, 2020

ryanmcafee commented Aug 6, 2020

palma21 commented Aug 6, 2020

ryanmcafee commented Aug 7, 2020

palma21 commented Aug 7, 2020 •

edited

Loading

ghost commented Oct 7, 2020

ghost commented Oct 22, 2020

ImagePullBackOff: Failed to pull image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3" #1513

ImagePullBackOff: Failed to pull image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3" #1513

Comments

sanjeebsarangi commented Mar 16, 2020

jnoller commented Mar 16, 2020

cdunford commented Jul 3, 2020

ghost commented Aug 2, 2020

ryanmcafee commented Aug 6, 2020

ryanmcafee commented Aug 6, 2020

palma21 commented Aug 6, 2020

ryanmcafee commented Aug 7, 2020

palma21 commented Aug 7, 2020 • edited Loading

ghost commented Oct 7, 2020

ghost commented Oct 22, 2020

palma21 commented Aug 7, 2020 •

edited

Loading