Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImagePullBackOff: Failed to pull image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3" #1513

Closed
sanjeebsarangi opened this issue Mar 16, 2020 · 10 comments
Labels
Feedback General feedback stale Stale issue upstream

Comments

@sanjeebsarangi
Copy link

What happened:
ErrImagePull error for for NMI pod after node reboot.

What you expected to happen:
All pod should come back to running state after rebooting one or more AKS node.

How to reproduce it (as minimally and precisely as possible):
Rebooting an AKS node.

Anything else we need to know?:
We have 7 node cluster and we rebooted a node from Azure Console. NMI pod is not coming up with error below. This also happened with another pod in ACR in our own subscription.

`Events:
Type Reason Age From Message


Warning FailedCreatePodSandBox 45m (x71 over 61m) kubelet, aks-fastcompute-25023122-vmss000002 Failed create pod sandbox: rpc error: code = Unknown desc = failed pulling image "mcr.microsoft.com/k8s/core/pause:1.2.0": Error response from daemon: Get https://mcr.microsoft.com/v2/: dial tcp: lookup mcr.microsoft.com on 168.63.129.16:53: no such host
Normal SandboxChanged 39m kubelet, aks-fastcompute-25023122-vmss000002 Pod sandbox changed, it will be killed and re-created.
Warning Failed 38m (x3 over 39m) kubelet, aks-fastcompute-25023122-vmss000002 Failed to pull image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3": rpc error: code = Unknown desc = Error response from daemon: Get https://mcr.microsoft.com/v2/: dial tcp: lookup mcr.microsoft.com on 168.63.129.16:53: no such host
Warning Failed 38m (x3 over 39m) kubelet, aks-fastcompute-25023122-vmss000002 Error: ErrImagePull
Warning Failed 37m (x7 over 39m) kubelet, aks-fastcompute-25023122-vmss000002 Error: ImagePullBackOff
Normal Pulling 37m (x4 over 39m) kubelet, aks-fastcompute-25023122-vmss000002 Pulling image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3"
Normal BackOff 3m52s (x157 over 39m) kubelet, aks-fastcompute-25023122-vmss000002 Back-off pulling image "mcr.microsoft.com/k8s/aad-pod-identity/nmi:1.5.3"
`

Environment:

  • Kubernetes version: 1.14.8
  • Size of cluster (how many worker nodes are in the cluster?): 7
  • General description of workloads in the cluster Machine learning
  • Others:
@jnoller
Copy link
Contributor

jnoller commented Mar 16, 2020

Hi @sanjeebsarangi - thanks for following up, as noted in email, this is the same issue as #1373 - the imagepullbackoff is triggered by the IOPS load of the container image pulls on reboots and node failovers.

@jnoller jnoller added iops and removed triage labels Mar 16, 2020
@cdunford
Copy link

cdunford commented Jul 3, 2020

@jnoller is there a plan to address this in some way? It seems like a severe limitation if we cannot expect nodes to restart successfully

@ghost ghost added the action-required label Jul 28, 2020
@ghost
Copy link

ghost commented Aug 2, 2020

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Aug 2, 2020
@ryanmcafee
Copy link

Any update on this? I have clusters provisioned using Standard_D4as_v4 nodes and 256GB premium ssds and am seeing this issue when provisioning aad-pod-identity with Terraform. @jnoller What is the expected IOPS load of the container image pull?

@ryanmcafee
Copy link

Could this be related to using remote network disks for the OS nodes? Would Microsoft recommend using vms in the Kubernetes node pool which support local premium disks, rather then relying on remote storage?

@palma21
Copy link
Member

palma21 commented Aug 6, 2020

The above is a DNS issue, not IOPS

what is your container runtime version? Did you happen to open a support ticket for this?

CC @aramase @cpuguy83 @ritazh

@palma21 palma21 added Feedback General feedback upstream and removed Needs Attention 👋 Issues needs attention/assignee/owner action-required iops labels Aug 6, 2020
@ryanmcafee
Copy link

@palma21 I think it's related as @jnoller noted in: #1373 where it's noted that disk io saturation and throttling cause the cluster dns to fail and yes I created a ticket for this issue 120080724005711. I'm testing out Velero before rolling out this change to our additional Azure subscriptions/ aks clusters.

@palma21
Copy link
Member

palma21 commented Aug 7, 2020

Your case might be as I don't see your error, the OP case is not, the daemon is working fine and not throttled but not being able to find the registry IP. That doesn't even use cluster DNS. 168.63.129.16 is Azure DNS.

@ghost ghost added the stale Stale issue label Oct 7, 2020
@ghost
Copy link

ghost commented Oct 7, 2020

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@ghost ghost closed this as completed Oct 22, 2020
@ghost
Copy link

ghost commented Oct 22, 2020

This issue will now be closed because it hasn't had any activity for 15 days after stale. sanjeebsarangi feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 21, 2020
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Feedback General feedback stale Stale issue upstream
Projects
None yet
Development

No branches or pull requests

5 participants