Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet timeout generating ImagePullBackOff error #3084

Closed
vinibodruch opened this issue Nov 3, 2022 · 21 comments
Closed

Kubelet timeout generating ImagePullBackOff error #3084

vinibodruch opened this issue Nov 3, 2022 · 21 comments
Assignees
Milestone

Comments

@vinibodruch
Copy link

TL;DR

Where is the kubelet config file on Rancher 2.6.9 - RKE1, like this https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/
Can I manage it? Does this file exist?
Didn't find in /var/lib/kubelet

# pwd
/var/lib/kubelet
# ls -lha
total 16K
drwxr-xr-x   9 root root  185 Sep  5 13:20 .
drwxr-xr-x. 42 root root 4.0K Sep 22 15:50 ..
-rw-------   1 root root   62 Sep  5 13:20 cpu_manager_state
drwxr-xr-x   2 root root   45 Nov  1 11:27 device-plugins
-rw-------   1 root root   61 Sep  5 13:20 memory_manager_state
drwxr-xr-x   2 root root   44 Sep  5 13:20 pki
drwxr-x---   2 root root    6 Sep  5 13:20 plugins
drwxr-x---   2 root root    6 Sep  5 13:20 plugins_registry
drwxr-x---   2 root root   26 Nov  1 11:27 pod-resources
drwxr-x---  11 root root 4.0K Oct 24 23:57 pods
drwxr-xr-x   2 root root    6 Sep  5 13:20 volumeplugins

Explain

Recently we've upgraded the Kubernetes version to v1.24.4-rancher1-1 and to Rancher 2.6.9. Everything worked fine, but recently we've noticed a new behavior: If a image is to big or takes more than 2 minutes to accomplish the download, the Kubernetes raise an ErrImagePull.
To bypass this error, I need to login to the cluster, do a docker pull <image> to stop this error.

Error: ImagePullBackOff

~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                           
NAME                                 READY   STATUS              RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ContainerCreating   0          2m

~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                            
NAME                                 READY   STATUS         RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ErrImagePull   0          2m1s
                                                                                                                                                      
~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                           
NAME                                 READY   STATUS             RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ImagePullBackOff   0          2m12s

Searching for the problem, we discovered that the error is caused by a timeout in kubelet's request (2 minutes, accourding to the doc https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/), which could be raised with a flag --runtime-request-timeout duration. Changing the cluster.yaml with the below parameters, nothing happens:

[...]
    kubelet:
      extra_args:
        runtime-request-timeout: 10m
      fail_swap_on: false
[...]

The process running, showing that the parameter reflects to kubelet configuration

# ps -ef | grep runtime-request-timeout
root      7286  7267  0 Nov01 ?        00:00:00 /bin/bash /opt/rke-tools/entrypoint.sh kubelet {...} --runtime-request-timeout=10m {...}

In the official page, this parameter is deprecated, which explains this behavior, and to change it I need to alter a parameter named runtimeRequestTimeout inside a config-file.
So I have some doubts:

  • Where I change it?
  • This file exist in Rancher or I need to create it?
  • Is there a way to bypass with another parameter in extra_args?
  • Why this is happening now? Is because the deprecation of dockershim?

I read this docs too, but no sucess:

Configs and current versions

K8s version:

#  kubectl version --short
Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.24.4

RKE version:

# rke --version
rke version v1.3.15

Docker version: (docker version,docker info preferred)

# docker --version
Docker version 20.10.7, build f0df350

# docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
  scan: Docker Scan (Docker Inc., v0.12.0)

Server:
 Containers: 20
  Running: 20
  Paused: 0
  Stopped: 0
 Images: 8
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1c90a442489720eec95342e1789ee8a5e1b9536f
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1160.76.1.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 11.58GiB
 Name: anchieta
 ID: OZNJ:RKES:NTOH:G37X:4NYO:IJ3U:SKHO:FFG3:RJ7B:GCCJ:XOZN:NRHE
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

# uname -r
3.10.0-1160.76.1.el7.x86_64

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Vmware

cluster.yml file:

nodes:
- address: host1
  port: "22"
  internal_address: ""
  role:
  - controlplane
  - etcd
  hostname_override: ""
  user: rancher
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_rsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []

[...]

services:
  etcd:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: ""
    uid: 0
    gid: 0
    snapshot: null
    retention: ""
    creation: ""
    backup_config: null
  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    service_cluster_ip_range: 10.43.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
    secrets_encryption_config: null
    audit_log: null
    admission_configuration: null
    event_rate_limit: null
  kube-controller:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_cidr: 10.42.0.0/16
    service_cluster_ip_range: 10.43.0.0/16
  scheduler:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
  kubelet:
    image: ""
    extra_args: 
      runtime-request-timeout: 30m
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_domain: cluster.local
    infra_container_image: ""
    cluster_dns_server: 10.43.0.10
    fail_swap_on: false
    generate_serving_certificate: false
  kubeproxy:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
network:
  plugin: canal
  options: {}
  mtu: 0
  node_selector: {}
  update_strategy: null
  tolerations: []
authentication:
  strategy: x509
  sans: []
  webhook: null
addons: ""
addons_include: []
#Versões definidas em kubernetes_version
#system_images:
#  etcd: rancher/mirrored-coreos-etcd:v3.5.0
#  alpine: rancher/rke-tools:v0.1.78
#  nginx_proxy: rancher/rke-tools:v0.1.78
#  cert_downloader: rancher/rke-tools:v0.1.78
#  kubernetes_services_sidecar: rancher/rke-tools:v0.1.78
#  kubedns: rancher/mirrored-k8s-dns-kube-dns:1.17.4
#  dnsmasq: rancher/mirrored-k8s-dns-dnsmasq-nanny:1.17.4
#  kubedns_sidecar: rancher/mirrored-k8s-dns-sidecar:1.17.4
#  kubedns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.3
#  coredns: rancher/mirrored-coredns-coredns:1.8.6
#  coredns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.5
#  nodelocal: rancher/mirrored-k8s-dns-node-cache:1.21.1
#  #kubernetes: rancher/hyperkube:v1.24.4-rancher1
#  flannel: rancher/mirrored-coreos-flannel:v0.15.1
#  flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
#  calico_node: rancher/mirrored-calico-node:v3.21.1
#  calico_cni: rancher/mirrored-calico-cni:v3.21.1
#  calico_controllers: rancher/mirrored-calico-kube-controllers:v3.21.1
#  calico_ctl: rancher/mirrored-calico-ctl:v3.21.1
#  calico_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.21.1
#  canal_node: rancher/mirrored-calico-node:v3.21.1
#  canal_cni: rancher/mirrored-calico-cni:v3.21.1
#  canal_controllers: rancher/mirrored-calico-kube-controllers:v3.21.1
#  canal_flannel: rancher/mirrored-coreos-flannel:v0.15.1
#  canal_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.21.1
#  weave_node: weaveworks/weave-kube:2.8.1
#  weave_cni: weaveworks/weave-npc:2.8.1
#  pod_infra_container: rancher/mirrored-pause:3.5
#  ingress: rancher/nginx-ingress-controller:nginx-1.1.0-rancher1
#  ingress_backend: rancher/mirrored-nginx-ingress-controller-defaultbackend:1.5-rancher1
#  ingress_webhook: rancher/mirrored-ingress-nginx-kube-webhook-certgen:v1.1.1
#  metrics_server: rancher/mirrored-metrics-server:v0.5.1
#  windows_pod_infra_container: rancher/kubelet-pause:v0.1.6
#  aci_cni_deploy_container: noiro/cnideploy:5.1.1.0.1ae238a
#  aci_host_container: noiro/aci-containers-host:5.1.1.0.1ae238a
#  aci_opflex_container: noiro/opflex:5.1.1.0.1ae238a
#  aci_mcast_container: noiro/opflex:5.1.1.0.1ae238a
#  aci_ovs_container: noiro/openvswitch:5.1.1.0.1ae238a
#  aci_controller_container: noiro/aci-containers-controller:5.1.1.0.1ae238a
#  aci_gbp_server_container: noiro/gbp-server:5.1.1.0.1ae238a
#  aci_opflex_server_container: noiro/opflex-server:5.1.1.0.1ae238a
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: null
enable_cri_dockerd: null
kubernetes_version: "v1.24.4-rancher1-1"
private_registries: []
ingress:
  provider: ""
  options: {}
  node_selector: {}
  extra_args: {}
  dns_policy: ""
  extra_envs: []
  extra_volumes: []
  extra_volume_mounts: []
  update_strategy: null
  http_port: 0
  https_port: 0
  network_mode: ""
  tolerations: []
  default_backend: null
  default_http_backend_priority_class_name: ""
  nginx_ingress_controller_priority_class_name: ""
  default_ingress_class: null
cluster_name: ""
cloud_provider:
  name: ""
prefix_path: ""
win_prefix_path: ""
addon_job_timeout: 0
bastion_host:
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: ""
  ssh_cert: ""
  ssh_cert_path: ""
  ignore_proxy_env_vars: false
monitoring:
  provider: ""
  options: {}
  node_selector: {}
  update_strategy: null
  replicas: null
  tolerations: []
  metrics_server_priority_class_name: ""
restore:
  restore: false
  snapshot_name: ""
rotate_encryption_key: false
dns: null

I would be grateful if this help me and others to solve this annoying issue.

@gmanera
Copy link

gmanera commented Nov 3, 2022

same problem here.

@jiaqiluo
Copy link
Member

jiaqiluo commented Nov 4, 2022

Where is the kubelet config file on Rancher 2.6.9 - RKE1, like this kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file
Can I manage it? Does this file exist?

Rancher/RKE does not use the kubelet config file to configure kubelet, which sadly means you cannot find it anywhere. But it does not mean you cannot use it, and actually, you are very close to the final solution:

You need to set both the extra_args and extra_binds to make it work.
In your cluser.yml it will look like the following

services:
  kubelet:
    extra_args:
      config: path-to-the-config-file-in-the-container
    extra_binds:
      - "path-to-file-on-host:path-to-the-config-file-in-the-container"

And of course, you need to create/put such a config file on the control plan node beforehand.

I hope this is helpful.

@jiaqiluo
Copy link
Member

jiaqiluo commented Nov 4, 2022

and a caveat: AFAIK, the kubelet process does not auto-restart when the changes are made in the config file, which means you need to restart the kubelet container after changes are made to the "external" config file.

@likku123
Copy link

likku123 commented Nov 4, 2022

Still I am experiencing the time out issue.

kubelet:
extra_args:
config: /opt/kubelet_timeout_config.yaml
extra_binds:
- '/opt/kubelet_timeout_config.yaml:/opt/kubelet_timeout_config.yaml'

kubelet_timeout_config.yaml

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serializeImagePulls: false
runtimeRequestTimeout: "240m"

@jiaqiluo
Copy link
Member

jiaqiluo commented Nov 4, 2022

@likku123 can you do the following checks on the container kube-apiserver in the control plan node:

  • docker logs to check if there is any error message.
  • docker exec into the container to see if the config file exists and contains the proper context.
  • docker inspect to check if --config is set

If all the above look right, it means RKE has configured the kube-apiserver properly, then I will doubt if it is an upstream issue or something wrong outside of RKE.

@vinibodruch
Copy link
Author

Thanks for your response, Jiaqi Luo. I was trying this afternoon to solve this issue, but unfortunatly didn't find how.

The file kubelet-config.yml I used in each server:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
clientCAFile: "/etc/kubernetes/ssl/kube-ca.pem"
runtimeRequestTimeout: 45m0s
tlsCipherSuites: ["TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"]
failSwapOn: False
volumePluginDir: "/var/lib/kubelet/volumeplugin"
clusterDomain: "cluster.local"

The RKE config:

    kubelet:
      extra_args:
        config: /var/lib/kubelet/kubelet-config.yml
      extra_binds:
        - >-
          /var/lib/kubelet/kubelet-config.yml:/var/lib/kubelet/kubelet-config.yml

And finally, the process running in on server with --config, as example:

# ps -ef | grep kubelet
root     22181 22161  0 16:57 ?        00:00:00 /bin/bash /opt/rke-tools/entrypoint.sh kubelet --cgroups-per-qos=True --make-iptables-util-chains=true --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --cloud-provider= --fail-swap-on=false --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-node.yaml --container-runtime=remote --event-qps=0 --address=0.0.0.0 --config=/var/lib/kubelet/kubelet-config.yml --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --root-dir=/var/lib/kubelet --authentication-token-webhook=true --hostname-override=saquarema --container-runtime-endpoint=unix:///var/run/cri-dockerd.sock --anonymous-auth=false --v=2 --authorization-mode=Webhook --pod-infra-container-image=registry.hub.docker.com/rancher/mirrored-pause:3.6 --read-only-port=0 --resolv-conf=/etc/resolv.conf --streaming-connection-idle-timeout=30m --volume-plugin-dir=/var/lib/kubelet/volumeplugins`
root     23588 22181  3 16:58 ?        00:00:17 kubelet --cgroups-per-qos=True --make-iptables-util-chains=true --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --cloud-provider= --fail-swap-on=false --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-node.yaml --container-runtime=remote --event-qps=0 --address=0.0.0.0 --config=/var/lib/kubelet/kubelet-config.yml --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --root-dir=/var/lib/kubelet --authentication-token-webhook=true --hostname-override=saquarema --container-runtime-endpoint=unix:///var/run/cri-dockerd.sock --anonymous-auth=false --v=2 --authorization-mode=Webhook --pod-infra-container-image=registry.hub.docker.com/rancher/mirrored-pause:3.6 --read-only-port=0 --resolv-conf=/etc/resolv.conf --streaming-connection-idle-timeout=30m --volume-plugin-dir=/var/lib/kubelet/volumeplugins --cgroup-driver=cgroupfs
root     27494 27474  0 Nov03 ?        00:00:01 /csi-node-driver-registrar --v=2 --csi-address=/csi/csi.sock --kubelet-registration-path=/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock

I found a request which probably shows up the current kubelet configuration:

~ ❯ kubectl proxy --port=8001
~ ❯ NODE_NAME="host1"; curl -sSL "http://localhost:8001/api/v1/nodes/${NODE_NAME}/proxy/configz" | jq '.kubeletconfig|.kind="KubeletConfiguration"|.apiVersion="ku              18:48:53
belet.config.k8s.io/v1beta1"' > kubelet_config_${NODE_NAME}

the content returned:

{
    "kubeletconfig": {
        "enableServer": true,
        "syncFrequency": "1m0s",
        "fileCheckFrequency": "20s",
        "httpCheckFrequency": "20s",
        "address": "0.0.0.0",
        "port": 10250,
        "tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",
        "tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",
        "tlsCipherSuites": [
            "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
            "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
            "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
            "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
            "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
            "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"
        ],
        "authentication": {
            "x509": {
                "clientCAFile": "/etc/kubernetes/ssl/kube-ca.pem"
            },
            "webhook": {
                "enabled": true,
                "cacheTTL": "2m0s"
            },
            "anonymous": {
                "enabled": false
            }
        },
        "authorization": {
            "mode": "Webhook",
            "webhook": {
                "cacheAuthorizedTTL": "5m0s",
                "cacheUnauthorizedTTL": "30s"
            }
        },
        "registryPullQPS": 5,
        "registryBurst": 10,
        "eventRecordQPS": 0,
        "eventBurst": 10,
        "enableDebuggingHandlers": true,
        "healthzPort": 10248,
        "healthzBindAddress": "127.0.0.1",
        "oomScoreAdj": -999,
        "clusterDomain": "cluster.local",
        "clusterDNS": [
            "10.43.0.10"
        ],
        "streamingConnectionIdleTimeout": "30m0s",
        "nodeStatusUpdateFrequency": "10s",
        "nodeStatusReportFrequency": "5m0s",
        "nodeLeaseDurationSeconds": 40,
        "imageMinimumGCAge": "2m0s",
        "imageGCHighThresholdPercent": 85,
        "imageGCLowThresholdPercent": 80,
        "volumeStatsAggPeriod": "1m0s",
        "cgroupsPerQOS": true,
        "cgroupDriver": "cgroupfs",
        "cpuManagerPolicy": "none",
        "cpuManagerReconcilePeriod": "10s",
        "memoryManagerPolicy": "None",
        "topologyManagerPolicy": "none",
        "topologyManagerScope": "container",
        "runtimeRequestTimeout": "40m0s",
        "hairpinMode": "promiscuous-bridge",
        "maxPods": 110,
        "podPidsLimit": -1,
        "resolvConf": "/etc/resolv.conf",
        "cpuCFSQuota": true,
        "cpuCFSQuotaPeriod": "100ms",
        "nodeStatusMaxImages": 50,
        "maxOpenFiles": 1000000,
        "contentType": "application/vnd.kubernetes.protobuf",
        "kubeAPIQPS": 5,
        "kubeAPIBurst": 10,
        "serializeImagePulls": true,
        "evictionHard": {
            "imagefs.available": "15%",
            "memory.available": "100Mi",
            "nodefs.available": "10%",
            "nodefs.inodesFree": "5%"
        },
        "evictionPressureTransitionPeriod": "5m0s",
        "enableControllerAttachDetach": true,
        "makeIPTablesUtilChains": true,
        "iptablesMasqueradeBit": 14,
        "iptablesDropBit": 15,
        "failSwapOn": false,
        "memorySwap": {},
        "containerLogMaxSize": "10Mi",
        "containerLogMaxFiles": 5,
        "configMapAndSecretChangeDetectionStrategy": "Watch",
        "enforceNodeAllocatable": [
            "pods"
        ],
        "volumePluginDir": "/var/lib/kubelet/volumeplugins",
        "logging": {
            "format": "text",
            "flushFrequency": 5000000000,
            "verbosity": 1,
            "options": {
                "json": {
                    "infoBufferSize": "0"
                }
            }
        },
        "enableSystemLogHandler": true,
        "shutdownGracePeriod": "0s",
        "shutdownGracePeriodCriticalPods": "0s",
        "enableProfilingHandler": true,
        "enableDebugFlagsHandler": true,
        "seccompDefault": false,
        "memoryThrottlingFactor": 0.8,
        "registerWithTaints": [
            {
                "key": "node-role.kubernetes.io/controlplane",
                "value": "true",
                "effect": "NoSchedule"
            }
        ],
        "registerNode": true
    }
}

I restarted the kubelet and the server, but the ErrImagePull behavior, if takes more than 2 minutes, still persists...
docker container inspect kubelet dows have the --config flag, and docker container logs kubelet is not so helpful, just showing ErrImagePull: rpc error: code = Unknown desc = context deadline exceeded

So starting to search more about it, I found similar issues:

And a Pull request with this issue to: kubernetes/minikube#13600

So this it's probably a bug! But I found something interesting that I'll try later, something relationed to changing the container runtime: kubernetes/minikube#14789 (comment)

@likku123
Copy link

likku123 commented Nov 5, 2022

This is definitely issue with cri-dockerd version which comes along with rke-tools.
Right now the version is
/opt/rke-tools/bin# ./cri-dockerd --version
cri-dockerd 0.2.4 (4b57f30)
As per this link kubernetes/minikube#14789 (comment) cri-dockerd 0.2.6 is the patch which solves the timeout issue.

Any suggestions to deploy cri-dockerd 0.2.6 in my present setup

@vinibodruch
Copy link
Author

No idea how to upgrade it...
Same version of cri-dockerd too

bash-5.1# ./cri-dockerd --version
cri-dockerd 0.2.4 (4b57f30)

@gmanera
Copy link

gmanera commented Nov 7, 2022

@jiaqiluo ,
Thanks for your response. You managed to help us how to configure the kubelet config file.
However, still using the following configuration, the issue persists.

kubelet: extra_args: config: /var/lib/kubelet/kubelet-config.yml extra_binds: - >- /var/lib/kubelet/kubelet-config.yml:/var/lib/kubelet/kubelet-config.yml

Our cri-dockerd version: cri-dockerd 0.2.4
The cri-dockerd is based in the rancher/rke-tools, we're using the last one available (v0.1.87).

What're our options from here?

Thanks in advance.

@iTaybb
Copy link

iTaybb commented Nov 8, 2022

I also have the same issue in RKE 1.24.

@gmanera
Copy link

gmanera commented Nov 10, 2022

@jiaqiluo ,
It's possible to update only the cri-dockerd version?
We are using the https://github.com/rancher/rke-tools/releases/tag/v0.1.87 that has the cri-dockerd 0.2.4.
There any kind of prevision to create a new rke-tools version with the cri-dockerd updated?

We're using the kubelet config file (we can se throut docker inspect kubelet) but the exatly same problem persists.

We've a limited internet connection (From Brazil), and images like Airflow, Redis, RabbitMQ exceeds the default timeout of 2 minutes.

@vinibodruch or me can send to you any kind of log or information.

Thanks in advance.

@vinibodruch
Copy link
Author

Looking for similar issues, this is the only thing that I thought it could be the solution: #3051
Is it difficult to change this version @jiaqiluo ?:
https://github.com/rancher/rke-tools/blob/2c35b5525f4c17b0cc64f9266f760922216ab9fd/package/Dockerfile#L8

@horihel
Copy link

horihel commented Nov 16, 2022

I'd be happy if anyone knows a decent workaround (that hopefully doesn't involve SSHing into each node and running docker pull). This is seriously disturbing cluster operations as there's some cluster images that are impossible to complete pulling within 2m and will endlessly fail.

@gmanera
Copy link

gmanera commented Nov 16, 2022

@horihel ,
Unfortony we don't know any other workaroung. SSHing into each node and running docker pull is the only way so far.
We need to await the Rancher Community.
@jiaqiluo or @superseb we really appreciate if you can guide us.

Thanks in advance.

@likku123
Copy link

Hack I am doing right now is . Using ansible script i am manually downloading the required images in nodes and scheduled cron job to pull latest changes regularly.
Note: Developers have provided the required list of images ( 26 images) they will be using for there run which makes easy to achieve it

@gmanera
Copy link

gmanera commented Nov 18, 2022

Hi, @superseb and @jiaqiluo ,
It's possible to help us on this issue? This is seriously disturbing our cluster operations.

Thanks in advance.

@iTaybb
Copy link

iTaybb commented Nov 18, 2022

I've rolled back to 1.23 for the time being.

@jiaqiluo
Copy link
Member

Hi @likku123 @gmanera @iTaybb @vinibodruch
Sorry for the late reply, I was out sick for the past two weeks and just returned today.
I am glad to see that you guys figured out the root cause and the fix! I can definitely update the cri-dockerd version used in rancher/rke-tools to v0.2.6.
I will fit it into the team's schedule and try to get the fix out ASAP, but sorry that I cannot guarantee a date.
Thank you for your understanding.

@kinarashah
Copy link
Member

Mirantis/cri-dockerd#105

@snasovich
Copy link
Collaborator

/backport v1.3.17

@vivek-shilimkar
Copy link
Member

Issue was reproducible on RKE v1.4.1-rc1.
Cluster throws an error ErrImagePull and ImagePullBackOff Image that needs more than 2 minutes to pull.

(Ignore folder name)
Screenshot from 2022-11-23 15-59-04

Fixes for the above error were validated with the RKE v1.4.1-rc2.

Validations steps

  1. Created k8s clusters v1.22.16, v1.23.14, v1.24.8 with RKE v1.4.1-rc2.
  2. Made sure the rke-tools version is v0.1.88.
  3. Created a pod with an image that takes more than 2 minutes to pull.
  4. Waited for pod to come to an active state.
  5. Pod comes to an active state after 5 minutes.

(Ignore folder name)
Screenshot from 2022-11-23 16-12-05

Issue is not active with RKE v1.4.1-rc2. Hence closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants