Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

The service cannot be started after installation. #5576

Open
siaimes opened this issue Jul 27, 2021 · 1 comment
Open

The service cannot be started after installation. #5576

siaimes opened this issue Jul 27, 2021 · 1 comment

Comments

@siaimes
Copy link
Contributor

siaimes commented Jul 27, 2021

maybe duel to docker-cache service do not start after /bin/bash quick-start-service.sh

But the job cannot be started. This is the log of kubelet, it keeps outputting the image pull completion. The openpai system did not output any logs.

csip@csip-090:~$ sudo systemctl status kubelet.service 
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-04-27 13:19:30 CST; 4min 39s ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
 Main PID: 173276 (kubelet)
    Tasks: 0
   Memory: 32.5M
      CPU: 1.107s
   CGroup: /system.slice/kubelet.service
           ‣ 173276 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.17.175.90 --hostname-override=csip-090 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf 

Apr 27 13:23:31 csip-090 kubelet[173276]: I0427 13:23:31.097675  173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete "
Apr 27 13:23:31 csip-090 kubelet[173276]: I0427 13:23:31.776293  173276 setters.go:73] Using node IP: "172.17.175.90"
Apr 27 13:23:41 csip-090 kubelet[173276]: I0427 13:23:41.097618  173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete "
Apr 27 13:23:41 csip-090 kubelet[173276]: I0427 13:23:41.788918  173276 setters.go:73] Using node IP: "172.17.175.90"
Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.097632  173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete "
Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.382029  173276 kubelet_getters.go:177] status for pod nginx-proxy-csip-090 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-04-27 11:48:40 +0800 CST  } {Ready True 0001-01-01 00:00:00 +000
Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.662060  173276 endpoint.go:111] State pushed for device plugin github.com/fuse
Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.800339  173276 setters.go:73] Using node IP: "172.17.175.90"
Apr 27 13:24:01 csip-090 kubelet[173276]: I0427 13:24:01.097620  173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete "
Apr 27 13:24:01 csip-090 kubelet[173276]: I0427 13:24:01.813667  173276 setters.go:73] Using node IP: "172.17.175.90"

Originally posted by @siaimes in #5445 (comment)

@siaimes
Copy link
Contributor Author

siaimes commented Jul 27, 2021

The reason why I can manually start the docker-cache service in this #5445 (comment) is that I have been upgrading from v1.3.0 to v1.6.0, so I already have images related to openpai v1.6.0 on my node.

However, for users with a fresh installation, running the command /bin/bash quick-start-service.sh will fail.

  1. All nodes do not have images of the corresponding version of openpai;
  2. All nodes have modified /etc/docker/daemon.json to point the image pull address to the master node;
  3. The docker-cache service of the master node is not started.

It can be found that these three conditions are in conflict.

Possible solutions:
Pull the docker-cache image to the master node before restarting docker. When running /bin/bash quick-start-service.sh, first start the docker-cache service.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant