Kubelet timeout generating ImagePullBackOff error #3084

vinibodruch · 2022-11-03T19:38:19Z

TL;DR

Where is the kubelet config file on Rancher 2.6.9 - RKE1, like this https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/
Can I manage it? Does this file exist?
Didn't find in /var/lib/kubelet

# pwd
/var/lib/kubelet
# ls -lha
total 16K
drwxr-xr-x   9 root root  185 Sep  5 13:20 .
drwxr-xr-x. 42 root root 4.0K Sep 22 15:50 ..
-rw-------   1 root root   62 Sep  5 13:20 cpu_manager_state
drwxr-xr-x   2 root root   45 Nov  1 11:27 device-plugins
-rw-------   1 root root   61 Sep  5 13:20 memory_manager_state
drwxr-xr-x   2 root root   44 Sep  5 13:20 pki
drwxr-x---   2 root root    6 Sep  5 13:20 plugins
drwxr-x---   2 root root    6 Sep  5 13:20 plugins_registry
drwxr-x---   2 root root   26 Nov  1 11:27 pod-resources
drwxr-x---  11 root root 4.0K Oct 24 23:57 pods
drwxr-xr-x   2 root root    6 Sep  5 13:20 volumeplugins

Explain

Recently we've upgraded the Kubernetes version to v1.24.4-rancher1-1 and to Rancher 2.6.9. Everything worked fine, but recently we've noticed a new behavior: If a image is to big or takes more than 2 minutes to accomplish the download, the Kubernetes raise an ErrImagePull.
To bypass this error, I need to login to the cluster, do a docker pull <image> to stop this error.

Error: ImagePullBackOff

~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                           
NAME                                 READY   STATUS              RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ContainerCreating   0          2m

~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                            
NAME                                 READY   STATUS         RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ErrImagePull   0          2m1s
                                                                                                                                                      
~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                           
NAME                                 READY   STATUS             RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ImagePullBackOff   0          2m12s

Searching for the problem, we discovered that the error is caused by a timeout in kubelet's request (2 minutes, accourding to the doc https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/), which could be raised with a flag --runtime-request-timeout duration. Changing the cluster.yaml with the below parameters, nothing happens:

[...]
    kubelet:
      extra_args:
        runtime-request-timeout: 10m
      fail_swap_on: false
[...]

The process running, showing that the parameter reflects to kubelet configuration

# ps -ef | grep runtime-request-timeout
root      7286  7267  0 Nov01 ?        00:00:00 /bin/bash /opt/rke-tools/entrypoint.sh kubelet {...} --runtime-request-timeout=10m {...}

In the official page, this parameter is deprecated, which explains this behavior, and to change it I need to alter a parameter named runtimeRequestTimeout inside a config-file.
So I have some doubts:

Where I change it?
This file exist in Rancher or I need to create it?
Is there a way to bypass with another parameter in extra_args?
Why this is happening now? Is because the deprecation of dockershim?

I read this docs too, but no sucess:

Configs and current versions

K8s version:

#  kubectl version --short
Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.24.4

RKE version:

# rke --version
rke version v1.3.15

Docker version: (docker version,docker info preferred)

# docker --version
Docker version 20.10.7, build f0df350

# docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
  scan: Docker Scan (Docker Inc., v0.12.0)

Server:
 Containers: 20
  Running: 20
  Paused: 0
  Stopped: 0
 Images: 8
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1c90a442489720eec95342e1789ee8a5e1b9536f
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1160.76.1.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 11.58GiB
 Name: anchieta
 ID: OZNJ:RKES:NTOH:G37X:4NYO:IJ3U:SKHO:FFG3:RJ7B:GCCJ:XOZN:NRHE
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

# uname -r
3.10.0-1160.76.1.el7.x86_64

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Vmware

cluster.yml file:

nodes:
- address: host1
  port: "22"
  internal_address: ""
  role:
  - controlplane
  - etcd
  hostname_override: ""
  user: rancher
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_rsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []

[...]

services:
  etcd:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: ""
    uid: 0
    gid: 0
    snapshot: null
    retention: ""
    creation: ""
    backup_config: null
  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    service_cluster_ip_range: 10.43.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
    secrets_encryption_config: null
    audit_log: null
    admission_configuration: null
    event_rate_limit: null
  kube-controller:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_cidr: 10.42.0.0/16
    service_cluster_ip_range: 10.43.0.0/16
  scheduler:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
  kubelet:
    image: ""
    extra_args: 
      runtime-request-timeout: 30m
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_domain: cluster.local
    infra_container_image: ""
    cluster_dns_server: 10.43.0.10
    fail_swap_on: false
    generate_serving_certificate: false
  kubeproxy:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
network:
  plugin: canal
  options: {}
  mtu: 0
  node_selector: {}
  update_strategy: null
  tolerations: []
authentication:
  strategy: x509
  sans: []
  webhook: null
addons: ""
addons_include: []
#Versões definidas em kubernetes_version
#system_images:
#  etcd: rancher/mirrored-coreos-etcd:v3.5.0
#  alpine: rancher/rke-tools:v0.1.78
#  nginx_proxy: rancher/rke-tools:v0.1.78
#  cert_downloader: rancher/rke-tools:v0.1.78
#  kubernetes_services_sidecar: rancher/rke-tools:v0.1.78
#  kubedns: rancher/mirrored-k8s-dns-kube-dns:1.17.4
#  dnsmasq: rancher/mirrored-k8s-dns-dnsmasq-nanny:1.17.4
#  kubedns_sidecar: rancher/mirrored-k8s-dns-sidecar:1.17.4
#  kubedns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.3
#  coredns: rancher/mirrored-coredns-coredns:1.8.6
#  coredns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.5
#  nodelocal: rancher/mirrored-k8s-dns-node-cache:1.21.1
#  #kubernetes: rancher/hyperkube:v1.24.4-rancher1
#  flannel: rancher/mirrored-coreos-flannel:v0.15.1
#  flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
#  calico_node: rancher/mirrored-calico-node:v3.21.1
#  calico_cni: rancher/mirrored-calico-cni:v3.21.1
#  calico_controllers: rancher/mirrored-calico-kube-controllers:v3.21.1
#  calico_ctl: rancher/mirrored-calico-ctl:v3.21.1
#  calico_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.21.1
#  canal_node: rancher/mirrored-calico-node:v3.21.1
#  canal_cni: rancher/mirrored-calico-cni:v3.21.1
#  canal_controllers: rancher/mirrored-calico-kube-controllers:v3.21.1
#  canal_flannel: rancher/mirrored-coreos-flannel:v0.15.1
#  canal_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.21.1
#  weave_node: weaveworks/weave-kube:2.8.1
#  weave_cni: weaveworks/weave-npc:2.8.1
#  pod_infra_container: rancher/mirrored-pause:3.5
#  ingress: rancher/nginx-ingress-controller:nginx-1.1.0-rancher1
#  ingress_backend: rancher/mirrored-nginx-ingress-controller-defaultbackend:1.5-rancher1
#  ingress_webhook: rancher/mirrored-ingress-nginx-kube-webhook-certgen:v1.1.1
#  metrics_server: rancher/mirrored-metrics-server:v0.5.1
#  windows_pod_infra_container: rancher/kubelet-pause:v0.1.6
#  aci_cni_deploy_container: noiro/cnideploy:5.1.1.0.1ae238a
#  aci_host_container: noiro/aci-containers-host:5.1.1.0.1ae238a
#  aci_opflex_container: noiro/opflex:5.1.1.0.1ae238a
#  aci_mcast_container: noiro/opflex:5.1.1.0.1ae238a
#  aci_ovs_container: noiro/openvswitch:5.1.1.0.1ae238a
#  aci_controller_container: noiro/aci-containers-controller:5.1.1.0.1ae238a
#  aci_gbp_server_container: noiro/gbp-server:5.1.1.0.1ae238a
#  aci_opflex_server_container: noiro/opflex-server:5.1.1.0.1ae238a
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: null
enable_cri_dockerd: null
kubernetes_version: "v1.24.4-rancher1-1"
private_registries: []
ingress:
  provider: ""
  options: {}
  node_selector: {}
  extra_args: {}
  dns_policy: ""
  extra_envs: []
  extra_volumes: []
  extra_volume_mounts: []
  update_strategy: null
  http_port: 0
  https_port: 0
  network_mode: ""
  tolerations: []
  default_backend: null
  default_http_backend_priority_class_name: ""
  nginx_ingress_controller_priority_class_name: ""
  default_ingress_class: null
cluster_name: ""
cloud_provider:
  name: ""
prefix_path: ""
win_prefix_path: ""
addon_job_timeout: 0
bastion_host:
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: ""
  ssh_cert: ""
  ssh_cert_path: ""
  ignore_proxy_env_vars: false
monitoring:
  provider: ""
  options: {}
  node_selector: {}
  update_strategy: null
  replicas: null
  tolerations: []
  metrics_server_priority_class_name: ""
restore:
  restore: false
  snapshot_name: ""
rotate_encryption_key: false
dns: null

I would be grateful if this help me and others to solve this annoying issue.

The text was updated successfully, but these errors were encountered:

gmanera · 2022-11-03T19:41:37Z

same problem here.

jiaqiluo · 2022-11-04T04:57:24Z

Where is the kubelet config file on Rancher 2.6.9 - RKE1, like this kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file
Can I manage it? Does this file exist?

Rancher/RKE does not use the kubelet config file to configure kubelet, which sadly means you cannot find it anywhere. But it does not mean you cannot use it, and actually, you are very close to the final solution:

You need to set both the extra_args and extra_binds to make it work.
In your cluser.yml it will look like the following

services:
  kubelet:
    extra_args:
      config: path-to-the-config-file-in-the-container
    extra_binds:
      - "path-to-file-on-host:path-to-the-config-file-in-the-container"

And of course, you need to create/put such a config file on the control plan node beforehand.

I hope this is helpful.

jiaqiluo · 2022-11-04T05:00:53Z

and a caveat: AFAIK, the kubelet process does not auto-restart when the changes are made in the config file, which means you need to restart the kubelet container after changes are made to the "external" config file.

likku123 · 2022-11-04T17:09:26Z

Still I am experiencing the time out issue.

kubelet:
extra_args:
config: /opt/kubelet_timeout_config.yaml
extra_binds:
- '/opt/kubelet_timeout_config.yaml:/opt/kubelet_timeout_config.yaml'

kubelet_timeout_config.yaml

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serializeImagePulls: false
runtimeRequestTimeout: "240m"

jiaqiluo · 2022-11-04T17:42:05Z

@likku123 can you do the following checks on the container kube-apiserver in the control plan node:

docker logs to check if there is any error message.
docker exec into the container to see if the config file exists and contains the proper context.
docker inspect to check if --config is set

If all the above look right, it means RKE has configured the kube-apiserver properly, then I will doubt if it is an upstream issue or something wrong outside of RKE.

vinibodruch · 2022-11-05T02:44:11Z

Thanks for your response, Jiaqi Luo. I was trying this afternoon to solve this issue, but unfortunatly didn't find how.

The file kubelet-config.yml I used in each server:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
clientCAFile: "/etc/kubernetes/ssl/kube-ca.pem"
runtimeRequestTimeout: 45m0s
tlsCipherSuites: ["TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"]
failSwapOn: False
volumePluginDir: "/var/lib/kubelet/volumeplugin"
clusterDomain: "cluster.local"

The RKE config:

    kubelet:
      extra_args:
        config: /var/lib/kubelet/kubelet-config.yml
      extra_binds:
        - >-
          /var/lib/kubelet/kubelet-config.yml:/var/lib/kubelet/kubelet-config.yml

And finally, the process running in on server with --config, as example:

# ps -ef | grep kubelet
root     22181 22161  0 16:57 ?        00:00:00 /bin/bash /opt/rke-tools/entrypoint.sh kubelet --cgroups-per-qos=True --make-iptables-util-chains=true --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --cloud-provider= --fail-swap-on=false --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-node.yaml --container-runtime=remote --event-qps=0 --address=0.0.0.0 --config=/var/lib/kubelet/kubelet-config.yml --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --root-dir=/var/lib/kubelet --authentication-token-webhook=true --hostname-override=saquarema --container-runtime-endpoint=unix:///var/run/cri-dockerd.sock --anonymous-auth=false --v=2 --authorization-mode=Webhook --pod-infra-container-image=registry.hub.docker.com/rancher/mirrored-pause:3.6 --read-only-port=0 --resolv-conf=/etc/resolv.conf --streaming-connection-idle-timeout=30m --volume-plugin-dir=/var/lib/kubelet/volumeplugins`
root     23588 22181  3 16:58 ?        00:00:17 kubelet --cgroups-per-qos=True --make-iptables-util-chains=true --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --cloud-provider= --fail-swap-on=false --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-node.yaml --container-runtime=remote --event-qps=0 --address=0.0.0.0 --config=/var/lib/kubelet/kubelet-config.yml --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --root-dir=/var/lib/kubelet --authentication-token-webhook=true --hostname-override=saquarema --container-runtime-endpoint=unix:///var/run/cri-dockerd.sock --anonymous-auth=false --v=2 --authorization-mode=Webhook --pod-infra-container-image=registry.hub.docker.com/rancher/mirrored-pause:3.6 --read-only-port=0 --resolv-conf=/etc/resolv.conf --streaming-connection-idle-timeout=30m --volume-plugin-dir=/var/lib/kubelet/volumeplugins --cgroup-driver=cgroupfs
root     27494 27474  0 Nov03 ?        00:00:01 /csi-node-driver-registrar --v=2 --csi-address=/csi/csi.sock --kubelet-registration-path=/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock

I found a request which probably shows up the current kubelet configuration:

~ ❯ kubectl proxy --port=8001
~ ❯ NODE_NAME="host1"; curl -sSL "http://localhost:8001/api/v1/nodes/${NODE_NAME}/proxy/configz" | jq '.kubeletconfig|.kind="KubeletConfiguration"|.apiVersion="ku              18:48:53
belet.config.k8s.io/v1beta1"' > kubelet_config_${NODE_NAME}

the content returned:

{
    "kubeletconfig": {
        "enableServer": true,
        "syncFrequency": "1m0s",
        "fileCheckFrequency": "20s",
        "httpCheckFrequency": "20s",
        "address": "0.0.0.0",
        "port": 10250,
        "tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",
        "tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",
        "tlsCipherSuites": [
            "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
            "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
            "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
            "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
            "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
            "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"
        ],
        "authentication": {
            "x509": {
                "clientCAFile": "/etc/kubernetes/ssl/kube-ca.pem"
            },
            "webhook": {
                "enabled": true,
                "cacheTTL": "2m0s"
            },
            "anonymous": {
                "enabled": false
            }
        },
        "authorization": {
            "mode": "Webhook",
            "webhook": {
                "cacheAuthorizedTTL": "5m0s",
                "cacheUnauthorizedTTL": "30s"
            }
        },
        "registryPullQPS": 5,
        "registryBurst": 10,
        "eventRecordQPS": 0,
        "eventBurst": 10,
        "enableDebuggingHandlers": true,
        "healthzPort": 10248,
        "healthzBindAddress": "127.0.0.1",
        "oomScoreAdj": -999,
        "clusterDomain": "cluster.local",
        "clusterDNS": [
            "10.43.0.10"
        ],
        "streamingConnectionIdleTimeout": "30m0s",
        "nodeStatusUpdateFrequency": "10s",
        "nodeStatusReportFrequency": "5m0s",
        "nodeLeaseDurationSeconds": 40,
        "imageMinimumGCAge": "2m0s",
        "imageGCHighThresholdPercent": 85,
        "imageGCLowThresholdPercent": 80,
        "volumeStatsAggPeriod": "1m0s",
        "cgroupsPerQOS": true,
        "cgroupDriver": "cgroupfs",
        "cpuManagerPolicy": "none",
        "cpuManagerReconcilePeriod": "10s",
        "memoryManagerPolicy": "None",
        "topologyManagerPolicy": "none",
        "topologyManagerScope": "container",
        "runtimeRequestTimeout": "40m0s",
        "hairpinMode": "promiscuous-bridge",
        "maxPods": 110,
        "podPidsLimit": -1,
        "resolvConf": "/etc/resolv.conf",
        "cpuCFSQuota": true,
        "cpuCFSQuotaPeriod": "100ms",
        "nodeStatusMaxImages": 50,
        "maxOpenFiles": 1000000,
        "contentType": "application/vnd.kubernetes.protobuf",
        "kubeAPIQPS": 5,
        "kubeAPIBurst": 10,
        "serializeImagePulls": true,
        "evictionHard": {
            "imagefs.available": "15%",
            "memory.available": "100Mi",
            "nodefs.available": "10%",
            "nodefs.inodesFree": "5%"
        },
        "evictionPressureTransitionPeriod": "5m0s",
        "enableControllerAttachDetach": true,
        "makeIPTablesUtilChains": true,
        "iptablesMasqueradeBit": 14,
        "iptablesDropBit": 15,
        "failSwapOn": false,
        "memorySwap": {},
        "containerLogMaxSize": "10Mi",
        "containerLogMaxFiles": 5,
        "configMapAndSecretChangeDetectionStrategy": "Watch",
        "enforceNodeAllocatable": [
            "pods"
        ],
        "volumePluginDir": "/var/lib/kubelet/volumeplugins",
        "logging": {
            "format": "text",
            "flushFrequency": 5000000000,
            "verbosity": 1,
            "options": {
                "json": {
                    "infoBufferSize": "0"
                }
            }
        },
        "enableSystemLogHandler": true,
        "shutdownGracePeriod": "0s",
        "shutdownGracePeriodCriticalPods": "0s",
        "enableProfilingHandler": true,
        "enableDebugFlagsHandler": true,
        "seccompDefault": false,
        "memoryThrottlingFactor": 0.8,
        "registerWithTaints": [
            {
                "key": "node-role.kubernetes.io/controlplane",
                "value": "true",
                "effect": "NoSchedule"
            }
        ],
        "registerNode": true
    }
}

I restarted the kubelet and the server, but the ErrImagePull behavior, if takes more than 2 minutes, still persists...
docker container inspect kubelet dows have the --config flag, and docker container logs kubelet is not so helpful, just showing ErrImagePull: rpc error: code = Unknown desc = context deadline exceeded

So starting to search more about it, I found similar issues:

And a Pull request with this issue to: kubernetes/minikube#13600

So this it's probably a bug! But I found something interesting that I'll try later, something relationed to changing the container runtime: kubernetes/minikube#14789 (comment)

likku123 · 2022-11-05T03:35:22Z

This is definitely issue with cri-dockerd version which comes along with rke-tools.
Right now the version is
/opt/rke-tools/bin# ./cri-dockerd --version
cri-dockerd 0.2.4 (4b57f30)
As per this link kubernetes/minikube#14789 (comment) cri-dockerd 0.2.6 is the patch which solves the timeout issue.

Any suggestions to deploy cri-dockerd 0.2.6 in my present setup

vinibodruch · 2022-11-05T05:26:56Z

No idea how to upgrade it...
Same version of cri-dockerd too

bash-5.1# ./cri-dockerd --version
cri-dockerd 0.2.4 (4b57f30)

gmanera · 2022-11-07T12:52:12Z

@jiaqiluo ,
Thanks for your response. You managed to help us how to configure the kubelet config file.
However, still using the following configuration, the issue persists.

kubelet: extra_args: config: /var/lib/kubelet/kubelet-config.yml extra_binds: - >- /var/lib/kubelet/kubelet-config.yml:/var/lib/kubelet/kubelet-config.yml

Our cri-dockerd version: cri-dockerd 0.2.4
The cri-dockerd is based in the rancher/rke-tools, we're using the last one available (v0.1.87).

What're our options from here?

Thanks in advance.

iTaybb · 2022-11-08T16:05:31Z

I also have the same issue in RKE 1.24.

gmanera · 2022-11-10T17:28:21Z

@jiaqiluo ,
It's possible to update only the cri-dockerd version?
We are using the https://github.com/rancher/rke-tools/releases/tag/v0.1.87 that has the cri-dockerd 0.2.4.
There any kind of prevision to create a new rke-tools version with the cri-dockerd updated?

We're using the kubelet config file (we can se throut docker inspect kubelet) but the exatly same problem persists.

We've a limited internet connection (From Brazil), and images like Airflow, Redis, RabbitMQ exceeds the default timeout of 2 minutes.

@vinibodruch or me can send to you any kind of log or information.

Thanks in advance.

vinibodruch · 2022-11-11T19:47:50Z

Looking for similar issues, this is the only thing that I thought it could be the solution: #3051
Is it difficult to change this version @jiaqiluo ?:
https://github.com/rancher/rke-tools/blob/2c35b5525f4c17b0cc64f9266f760922216ab9fd/package/Dockerfile#L8

horihel · 2022-11-16T09:14:53Z

I'd be happy if anyone knows a decent workaround (that hopefully doesn't involve SSHing into each node and running docker pull). This is seriously disturbing cluster operations as there's some cluster images that are impossible to complete pulling within 2m and will endlessly fail.

gmanera · 2022-11-16T12:36:44Z

@horihel ,
Unfortony we don't know any other workaroung. SSHing into each node and running docker pull is the only way so far.
We need to await the Rancher Community.
@jiaqiluo or @superseb we really appreciate if you can guide us.

Thanks in advance.

likku123 · 2022-11-16T12:47:53Z

Hack I am doing right now is . Using ansible script i am manually downloading the required images in nodes and scheduled cron job to pull latest changes regularly.
Note: Developers have provided the required list of images ( 26 images) they will be using for there run which makes easy to achieve it

gmanera · 2022-11-18T21:30:02Z

Hi, @superseb and @jiaqiluo ,
It's possible to help us on this issue? This is seriously disturbing our cluster operations.

Thanks in advance.

iTaybb · 2022-11-18T22:11:37Z

I've rolled back to 1.23 for the time being.

jiaqiluo · 2022-11-21T22:52:23Z

Hi @likku123 @gmanera @iTaybb @vinibodruch
Sorry for the late reply, I was out sick for the past two weeks and just returned today.
I am glad to see that you guys figured out the root cause and the fix! I can definitely update the cri-dockerd version used in rancher/rke-tools to v0.2.6.
I will fit it into the team's schedule and try to get the fix out ASAP, but sorry that I cannot guarantee a date.
Thank you for your understanding.

kinarashah · 2022-11-21T23:34:55Z

Mirantis/cri-dockerd#105

snasovich · 2022-11-21T23:38:50Z

/backport v1.3.17

vivek-shilimkar · 2022-11-24T05:47:26Z

Issue was reproducible on RKE v1.4.1-rc1.
Cluster throws an error ErrImagePull and ImagePullBackOff Image that needs more than 2 minutes to pull.

(Ignore folder name)

Fixes for the above error were validated with the RKE v1.4.1-rc2.

Validations steps

Created k8s clusters v1.22.16, v1.23.14, v1.24.8 with RKE v1.4.1-rc2.
Made sure the rke-tools version is v0.1.88.
Created a pod with an image that takes more than 2 minutes to pull.
Waited for pod to come to an active state.
Pod comes to an active state after 5 minutes.

(Ignore folder name)

Issue is not active with RKE v1.4.1-rc2. Hence closing the issue.

jiaqiluo assigned jiaqiluo and snasovich Nov 21, 2022

jiaqiluo mentioned this issue Nov 21, 2022

Update cri-dockerd to v0.2.6 #3051

Closed

jiaqiluo added the [zube]: Team Area 2 label Nov 21, 2022

snasovich assigned kinarashah Nov 21, 2022

kinarashah added the [zube]: Working label Nov 21, 2022

zube bot removed the [zube]: Team Area 2 label Nov 21, 2022

kinarashah added the [zube]: Team Area 2 label Nov 21, 2022

zube bot removed the [zube]: Working label Nov 21, 2022

snasovich added this to the v1.4.1 milestone Nov 21, 2022

rancherbot mentioned this issue Nov 21, 2022

[Backport v1.3] Kubelet timeout generating ImagePullBackOff error #3101

Closed

kinarashah mentioned this issue Nov 21, 2022

update cri-dockerd to v0.2.6 rancher/rke-tools#158

Merged

snasovich mentioned this issue Nov 22, 2022

Kubelet timeout generating ImagePullBackOff error / bump cri-dockerd in RKE1 rancher/rancher#39668

Closed

kinarashah mentioned this issue Nov 22, 2022

[v2.7] update rke-tools to v0.1.88 for v1.22.16, v1.23.14, v1.24.8 rancher/kontainer-driver-metadata#1022

Merged

vivek-shilimkar closed this as completed Nov 24, 2022

zube bot added [zube]: Done and removed [zube]: Team Area 2 labels Nov 24, 2022

vivek-shilimkar self-assigned this Nov 24, 2022

zube bot removed the [zube]: Done label Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubelet timeout generating ImagePullBackOff error #3084

Kubelet timeout generating ImagePullBackOff error #3084

vinibodruch commented Nov 3, 2022

gmanera commented Nov 3, 2022

jiaqiluo commented Nov 4, 2022

jiaqiluo commented Nov 4, 2022 •

edited

Loading

likku123 commented Nov 4, 2022 •

edited

Loading

jiaqiluo commented Nov 4, 2022 •

edited

Loading

vinibodruch commented Nov 5, 2022

likku123 commented Nov 5, 2022

vinibodruch commented Nov 5, 2022

gmanera commented Nov 7, 2022

iTaybb commented Nov 8, 2022

gmanera commented Nov 10, 2022

vinibodruch commented Nov 11, 2022

horihel commented Nov 16, 2022

gmanera commented Nov 16, 2022

likku123 commented Nov 16, 2022

gmanera commented Nov 18, 2022

iTaybb commented Nov 18, 2022

jiaqiluo commented Nov 21, 2022

kinarashah commented Nov 21, 2022

snasovich commented Nov 21, 2022

vivek-shilimkar commented Nov 24, 2022

Kubelet timeout generating ImagePullBackOff error #3084

Kubelet timeout generating ImagePullBackOff error #3084

Comments

vinibodruch commented Nov 3, 2022

TL;DR

Explain

Error: ImagePullBackOff

Configs and current versions

gmanera commented Nov 3, 2022

jiaqiluo commented Nov 4, 2022

jiaqiluo commented Nov 4, 2022 • edited Loading

likku123 commented Nov 4, 2022 • edited Loading

kubelet_timeout_config.yaml

jiaqiluo commented Nov 4, 2022 • edited Loading

vinibodruch commented Nov 5, 2022

likku123 commented Nov 5, 2022

vinibodruch commented Nov 5, 2022

gmanera commented Nov 7, 2022

iTaybb commented Nov 8, 2022

gmanera commented Nov 10, 2022

vinibodruch commented Nov 11, 2022

horihel commented Nov 16, 2022

gmanera commented Nov 16, 2022

likku123 commented Nov 16, 2022

gmanera commented Nov 18, 2022

iTaybb commented Nov 18, 2022

jiaqiluo commented Nov 21, 2022

kinarashah commented Nov 21, 2022

snasovich commented Nov 21, 2022

vivek-shilimkar commented Nov 24, 2022

jiaqiluo commented Nov 4, 2022 •

edited

Loading

likku123 commented Nov 4, 2022 •

edited

Loading

jiaqiluo commented Nov 4, 2022 •

edited

Loading