openshift_node: "restart node" ansible task fails sometimes #15332

gabemontero · 2017-07-19T14:26:52Z

See https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_templates/24/consoleFull

Failure summary:

  1. Host:     localhost
     Play:     Configure nodes
     Task:     openshift_node : restart node
     Message:  Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See "systemctl status origin-node.service" and "journalctl -xe" for details.

There have been similar incidents in the last few months but all issues I found referencing it have been closed so opening a new one.

@openshift/devex @stevekuznetsov @dobbymoodge FYI

The text was updated successfully, but these errors were encountered:

stevekuznetsov · 2017-07-19T18:59:51Z

What's in the node log?

gabemontero · 2017-07-19T23:50:52Z

Here's the last portion, including and error log and fatal log:

Jul 19 09:21:05 ip-172-18-12-128.ec2.internal origin-node[33640]: E0719 09:21:05.713262   33640 kubelet.go:2136] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Jul 19 09:21:05 ip-172-18-12-128.ec2.internal origin-node[33640]: I0719 09:21:05.759731   33640 eviction_manager.go:221] eviction manager: synchronize housekeeping
Jul 19 09:21:05 ip-172-18-12-128.ec2.internal origin-node[33640]: I0719 09:21:05.801287   33640 helpers.go:799] eviction manager: observations: signal=nodefs.inodesFree, available: 36454329, capacity: 36698048, time: 2017-07-19 09:20:55.55719983 -0400 EDT
Jul 19 09:21:05 ip-172-18-12-128.ec2.internal origin-node[33640]: I0719 09:21:05.801318   33640 helpers.go:799] eviction manager: observations: signal=imagefs.available, available: 10870272Ki, capacity: 20660Mi, time: 2017-07-19 09:20:55.55719983 -0400 EDT
Jul 19 09:21:05 ip-172-18-12-128.ec2.internal origin-node[33640]: I0719 09:21:05.801331   33640 helpers.go:801] eviction manager: observations: signal=allocatableMemory.available, available: 15902404Ki, capacity: 15902404Ki
Jul 19 09:21:05 ip-172-18-12-128.ec2.internal origin-node[33640]: I0719 09:21:05.801338   33640 helpers.go:799] eviction manager: observations: signal=memory.available, available: 13920956Ki, capacity: 16004804Ki, time: 2017-07-19 09:20:55.55719983 -0400 EDT
Jul 19 09:21:05 ip-172-18-12-128.ec2.internal origin-node[33640]: I0719 09:21:05.801349   33640 helpers.go:799] eviction manager: observations: signal=nodefs.available, available: 22762844Ki, capacity: 36687852Ki, time: 2017-07-19 09:20:55.55719983 -0400 EDT
Jul 19 09:21:05 ip-172-18-12-128.ec2.internal origin-node[33640]: W0719 09:21:05.801361   33640 helpers.go:771] eviction manager: no observation found for eviction signal allocatableNodeFs.available
Jul 19 09:21:05 ip-172-18-12-128.ec2.internal origin-node[33640]: I0719 09:21:05.801379   33640 eviction_manager.go:324] eviction manager: no resources are starved
Jul 19 09:21:06 ip-172-18-12-128.ec2.internal origin-node[33640]: I0719 09:21:06.537704   33640 generic.go:182] GenericPLEG: Relisting
Jul 19 09:21:07 ip-172-18-12-128.ec2.internal origin-node[33640]: I0719 09:21:07.496342   33640 kubelet.go:1954] SyncLoop (housekeeping)
Jul 19 09:21:07 ip-172-18-12-128.ec2.internal origin-node[33640]: I0719 09:21:07.540555   33640 generic.go:182] GenericPLEG: Relisting
Jul 19 09:21:08 ip-172-18-12-128.ec2.internal origin-node[33640]: W0719 09:21:08.130774   33640 sdn_controller.go:38] Could not find an allocated subnet for node: ip-172-18-12-128.ec2.internal, Waiting...
Jul 19 09:21:08 ip-172-18-12-128.ec2.internal origin-node[33640]: F0719 09:21:08.130794   33640 node.go:310] error: SDN node startup failed: failed to get subnet for this host: ip-172-18-12-128.ec2.internal, error: timed out waiting for the condition
Jul 19 09:21:08 ip-172-18-12-128.ec2.internal systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Jul 19 09:21:08 ip-172-18-12-128.ec2.internal systemd[1]: Failed to start OpenShift Node.
Jul 19 09:21:08 ip-172-18-12-128.ec2.internal systemd[1]: Unit origin-node.service entered failed state.
Jul 19 09:21:08 ip-172-18-12-128.ec2.internal systemd[1]: origin-node.service failed.

apooniajjn · 2017-10-06T16:20:06Z

I am seeing the same error on my setup when I am using cloud provider as AWS. Node registers itself with Master but it can't start node service. I would like to know what needs to be done.

root@osmaster01 centos]# kubectl describe node ip-10-30-2-251.us-west-1.compute.internal
Name:			ip-10-30-2-251.us-west-1.compute.internal
Role:
Labels:			beta.kubernetes.io/arch=amd64
			beta.kubernetes.io/instance-type=m4.xlarge
			beta.kubernetes.io/os=linux
			failure-domain.beta.kubernetes.io/region=us-west-1
			failure-domain.beta.kubernetes.io/zone=us-west-1a
			kubernetes.io/hostname=osnode04.bdteam.local
			region=primary
			zone=west
Annotations:		volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:			<none>
CreationTimestamp:	Fri, 06 Oct 2017 06:05:55 +0000
Phase:
Conditions:
  Type			Status	LastHeartbeatTime			LastTransitionTime			Reason				Message
  ----			------	-----------------			------------------			------				-------
  OutOfDisk 		False 	Fri, 06 Oct 2017 16:11:52 +0000 	Fri, 06 Oct 2017 06:05:55 +0000 	KubeletHasSufficientDisk 	kubelet has sufficient disk space available
  MemoryPressure 	False 	Fri, 06 Oct 2017 16:11:52 +0000 	Fri, 06 Oct 2017 06:05:55 +0000 	KubeletHasSufficientMemory 	kubelet has sufficient memory available
  DiskPressure 		False 	Fri, 06 Oct 2017 16:11:52 +0000 	Fri, 06 Oct 2017 06:05:55 +0000 	KubeletHasNoDiskPressure 	kubelet has no disk pressure
  Ready 		False 	Fri, 06 Oct 2017 16:11:52 +0000 	Fri, 06 Oct 2017 06:05:55 +0000 	KubeletNotReady 		runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Addresses:		10.30.2.251,10.30.2.251,ip-10-30-2-251.bdteam.local,osnode04.bdteam.local
Capacity:
 cpu:		4
 memory:	16266720Ki
 pods:		40
Allocatable:
 cpu:		4
 memory:	16164320Ki
 pods:		40
System Info:
 Machine ID:			8bd05758fdfc1903174c9fcaf82b71ca
 System UUID:			EC2798A7-3C88-0538-2A95-D28F2BCCDF96
 Boot ID:			5d7f71a8-95f8-4ed6-a7ba-07977e2dc926
 Kernel Version:		3.10.0-693.2.2.el7.x86_64
 OS Image:			CentOS Linux 7 (Core)
 Operating System:		linux
 Architecture:			amd64
 Container Runtime Version:	docker://1.12.6
 Kubelet Version:		v1.6.1+5115d708d7
 Kube-Proxy Version:		v1.6.1+5115d708d7
ExternalID:			i-08ae279780695c5f7
Non-terminated Pods:		(0 in total)
  Namespace			Name		CPU Requests	CPU Limits	Memory Requests	Memory Limits
  ---------			----		------------	----------	---------------	-------------
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests	CPU Limits	Memory Requests	Memory Limits
  ------------	----------	---------------	-------------
  0 (0%)	0 (0%)		0 (0%)		0 (0%)
Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----							-------------	--------	------			-------
  1h		1h		1	kubelet, ip-10-30-2-251.us-west-1.compute.internal			Warning		ImageGCFailed		unable to find data for container /
  1h		1h		1	kubelet, ip-10-30-2-251.us-west-1.compute.internal			Normal		NodeHasSufficientDisk	Node ip-10-30-2-251.us-west-1.compute.internal status is now: NodeHasSufficientDisk
  1h		1h		1	kubelet, ip-10-30-2-251.us-west-1.compute.internal			Normal		NodeHasSufficientMemory	Node ip-10-30-2-251.us-west-1.compute.internal status is now: NodeHasSufficientMemory
  1h		1h		1	kubelet, ip-10-30-2-251.us-west-1.compute.internal			Normal		NodeHasNoDiskPressure	Node ip-10-30-2-251.us-west-1.compute.internal status is now: NodeHasNoDiskPressure
  1h		1h		1	kubelet, ip-10-30-2-251.us-west-1.compute.internal			Normal		Starting		Starting kubelet.

stevekuznetsov · 2017-10-06T19:58:23Z

@poonia0arun can you post your node log? The issue above seemed to be from CNI plugins not loading.

/cc @mrunalp can you link in people that could help with this?

mrunalp · 2017-10-06T20:17:01Z

@danwinship @dcbw for the SDN startup issue.

apooniajjn · 2017-10-06T20:43:01Z

@stevekuznetsov @mrunalp @danwinship and @dcbw please find below requested logs and additional information.

Software Versions:

oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Ansible Hosts File

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
ansible_ssh_user=root
openshift_deployment_type=origin
openshift_master_cluster_method=native
openshift_master_cluster_hostname=osmasterelb.bdteam.local
openshift_master_cluster_public_hostname=osmasterelb.bdteam.local
openshift_clock_enabled=true
openshift_master_default_subdomain= apps.bdteam.local
openshift_cloudprovider_kind=aws
openshift_cloudprovider_aws_access_key=XXXXXXX
openshift_cloudprovider_aws_secret_key=XXXXXXXXX

# host group for masters
[masters]
osmaster01.bdteam.local openshift_hostname=osmaster01.bdteam.local
osmaster02.bdteam.local openshift_hostname=osmaster02.bdteam.local

[etcd]
openshift-etcd.bdteam.local openshift_hostname=openshift-etcd.bdteam.local

[nodes]
osmaster01.bdteam.local openshift_hostname=osmaster01.bdteam.local
osmaster02.bdteam.local openshift_hostname=osmaster02.bdteam.local
osnode01.bdteam.local openshift_node_labels="{'region': 'infra', 'zone': 'west'}"  openshift_hostname=osnode01.bdteam.local
osnode03.bdteam.local openshift_node_labels="{'region': 'infra', 'zone': 'west'}" openshift_hostname=osnode03.bdteam.local
osnode02.bdteam.local openshift_node_labels="{'region': 'primary', 'zone': 'west'}" openshift_hostname=osnode02.bdteam.local
osnode04.bdteam.local openshift_node_labels="{'region': 'primary', 'zone': 'west'}" openshift_hostname=osnode04.bdteam.local

Ansible Task fails during restarting the node services on each nodes/masters

fatal: [osnode04.bdteam.local]: FAILED! => {
    "attempts": 3,
    "changed": false,
    "failed": true,
    "invocation": {
        "module_args": {
            "daemon_reload": false,
            "enabled": null,
            "masked": null,
            "name": "origin-node",
            "no_block": false,
            "state": "restarted",
            "user": false
        }
    },
    "msg": "Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See \"systemctl status origin-node.service\" and \"journalctl -xe\" for details.\n"
}

RUNNING HANDLER [openshift_node : reload systemd units] ************************************************************************************************************************************************************************************************************************
META: ran handlers
	to retry, use: --limit @/root/openshift-ansible/playbooks/byo/config.retry

PLAY RECAP *********************************************************************************************************************************************************************************************************************************************************************
localhost                  : ok=21   changed=0    unreachable=0    failed=0
openshift-etcd.bdteam.local : ok=97   changed=34   unreachable=0    failed=0
osmaster01.bdteam.local    : ok=365  changed=110  unreachable=0    failed=1
osmaster02.bdteam.local    : ok=314  changed=95   unreachable=0    failed=1
osnode01.bdteam.local      : ok=146  changed=38   unreachable=0    failed=1
osnode02.bdteam.local      : ok=146  changed=38   unreachable=0    failed=1
osnode03.bdteam.local      : ok=146  changed=38   unreachable=0    failed=1
osnode04.bdteam.local      : ok=146  changed=38   unreachable=0    failed=1


INSTALLER STATUS ***************************************************************************************************************************************************************************************************************************************************************
Initialization             : Complete
etcd Install               : Complete
NFS Install                : Not Started
Load balancer Install      : Not Started
Master Install             : Complete
Master Additional Install  : Complete
Node Install               : In Progress
	This phase can be restarted by running: playbooks/byo/openshift-node/config.yml
GlusterFS Install          : Not Started
Hosted Install             : Not Started
Metrics Install            : Not Started
Logging Install            : Not Started
Service Catalog Install    : Not Started



Failure summary:


  1. Hosts:    osmaster01.bdteam.local, osmaster02.bdteam.local, osnode01.bdteam.local, osnode02.bdteam.local, osnode03.bdteam.local, osnode04.bdteam.local
     Play:     Configure nodes
     Task:     restart node
     Message:  Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See "systemctl status origin-node.service" and "journalctl -xe" for details.

[root@osmaster01 ~]# packet_write_wait: Connection to 10.X.X.X port 22: Broken pipe

Command oc get nodes output

[root@osmaster01 centos]# oc get nodes
NAME                                        STATUS     AGE       VERSION
ip-10-30-1-200.us-west-1.compute.internal   NotReady   2h        v1.6.1+5115d708d7
ip-10-30-1-27.us-west-1.compute.internal    NotReady   2h        v1.6.1+5115d708d7
ip-10-30-1-43.us-west-1.compute.internal    NotReady   2h        v1.6.1+5115d708d7
ip-10-30-2-109.us-west-1.compute.internal   NotReady   2h        v1.6.1+5115d708d7
ip-10-30-2-182.us-west-1.compute.internal   NotReady   2h        v1.6.1+5115d708d7
ip-10-30-2-251.us-west-1.compute.internal   NotReady   2h        v1.6.1+5115d708d7

Kubectl describe node output

[root@osmaster01 centos]# kubectl describe node ip-10-30-2-251.us-west-1.compute.internal
Name:			ip-10-30-2-251.us-west-1.compute.internal
Role:
Labels:			beta.kubernetes.io/arch=amd64
			beta.kubernetes.io/instance-type=m4.xlarge
			beta.kubernetes.io/os=linux
			failure-domain.beta.kubernetes.io/region=us-west-1
			failure-domain.beta.kubernetes.io/zone=us-west-1a
			kubernetes.io/hostname=osnode04.bdteam.local
			region=primary
			zone=west
Annotations:		volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:			<none>
CreationTimestamp:	Fri, 06 Oct 2017 18:10:56 +0000
Phase:
Conditions:
  Type			Status	LastHeartbeatTime			LastTransitionTime			Reason				Message
  ----			------	-----------------			------------------			------				-------
  OutOfDisk 		False 	Fri, 06 Oct 2017 20:37:50 +0000 	Fri, 06 Oct 2017 18:10:56 +0000 	KubeletHasSufficientDisk 	kubelet has sufficient disk space available
  MemoryPressure 	False 	Fri, 06 Oct 2017 20:37:50 +0000 	Fri, 06 Oct 2017 18:10:56 +0000 	KubeletHasSufficientMemory 	kubelet has sufficient memory available
  DiskPressure 		False 	Fri, 06 Oct 2017 20:37:50 +0000 	Fri, 06 Oct 2017 18:10:56 +0000 	KubeletHasNoDiskPressure 	kubelet has no disk pressure
  Ready 		False 	Fri, 06 Oct 2017 20:37:50 +0000 	Fri, 06 Oct 2017 18:10:56 +0000 	KubeletNotReady 		runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Addresses:		10.30.2.251,10.30.2.251,ip-10-30-2-251.bdteam.local,osnode04.bdteam.local
Capacity:
 cpu:		4
 memory:	16266720Ki
 pods:		40
Allocatable:
 cpu:		4
 memory:	16164320Ki
 pods:		40
System Info:
 Machine ID:			8bd05758fdfc1903174c9fcaf82b71ca
 System UUID:			EC2798A7-3C88-0538-2A95-D28F2BCCDF96
 Boot ID:			5d7f71a8-95f8-4ed6-a7ba-07977e2dc926
 Kernel Version:		3.10.0-693.2.2.el7.x86_64
 OS Image:			CentOS Linux 7 (Core)
 Operating System:		linux
 Architecture:			amd64
 Container Runtime Version:	docker://1.12.6
 Kubelet Version:		v1.6.1+5115d708d7
 Kube-Proxy Version:		v1.6.1+5115d708d7
ExternalID:			i-08ae279780695c5f7
Non-terminated Pods:		(0 in total)
  Namespace			Name		CPU Requests	CPU Limits	Memory Requests	Memory Limits
  ---------			----		------------	----------	---------------	-------------
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests	CPU Limits	Memory Requests	Memory Limits
  ------------	----------	---------------	-------------
  0 (0%)	0 (0%)		0 (0%)		0 (0%)
Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----							-------------	--------	------			-------
  1h		1h		1	kubelet, ip-10-30-2-251.us-west-1.compute.internal			Warning		ImageGCFailed		unable to find data for container /
  1h		1h		1	kubelet, ip-10-30-2-251.us-west-1.compute.internal			Normal		NodeHasSufficientDisk	Node ip-10-30-2-251.us-west-1.compute.internal status is now: NodeHasSufficientDisk
  1h		1h		1	kubelet, ip-10-30-2-251.us-west-1.compute.internal			Normal		NodeHasSufficientMemory	Node ip-10-30-2-251.us-west-1.compute.internal status is now: NodeHasSufficientMemory
  1h		1h		1	kubelet, ip-10-30-2-251.us-west-1.compute.internal			Normal		NodeHasNoDiskPressure	Node ip-10-30-2-251.us-west-1.compute.internal status is now: NodeHasNoDiskPressure
  1h		1h		1	kubelet, ip-10-30-2-251.us-west-1.compute.internal			Normal		Starting		Starting kubelet.

If I dont use any cloud provider in my ansible config.yml file my installation works fine but I need to resolve this for AWS or any cloud provider

Systemctl output of node service on a particular node

[root@osnode01 centos]# systemctl status origin-node.service
● origin-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/origin-node.service.d
           └─openshift-sdn-ovs.conf
   Active: activating (start) since Fri 2017-10-06 20:39:54 UTC; 8s ago
     Docs: https://github.com/openshift/origin
  Process: 56362 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
  Process: 56360 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
  Process: 56368 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
  Process: 56365 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
 Main PID: 56370 (openshift)
   Memory: 42.2M
   CGroup: /system.slice/origin-node.service
           ├─56370 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2
           └─56415 journalctl -k -f

Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546363   56370 pod_container_deletor.go:77] Container "2f4c53551f7b6e654cc1de1159d44856f81b6d16f4ed5d1eb580c9cb3a9bc575" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546425   56370 pod_container_deletor.go:77] Container "851f6503d78acd135e3a4b87009d4163a808856f14757f6123c1cf625123504d" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546448   56370 pod_container_deletor.go:77] Container "88a45a9147f05a0bd9e05ed712069f10b4cea6c2af3ccd0eb1601166f3ccf679" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546460   56370 pod_container_deletor.go:77] Container "a3ef9c2922877e2f25bd4814fd1f4e371fd98a19ad36b54371fd0b1bc51e255b" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546472   56370 pod_container_deletor.go:77] Container "c5102f50c2e01a2100e1dcb025096967e31134c43ffdb1655827b908e5b29f77" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546483   56370 pod_container_deletor.go:77] Container "d68f9392b34c6410e6154c95febcfb55dac109725750ae5c20671c39279c9730" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546494   56370 pod_container_deletor.go:77] Container "eb04adc0b544c64e20ac3c847e03de048f7c7a26ce4d4a6b46282817d0df8e10" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.710842   56370 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: E1006 20:39:59.710981   56370 kubelet.go:2072] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Oct 06 20:40:00 osnode01.bdteam.local origin-node[56370]: W1006 20:40:00.816290   56370 sdn_controller.go:38] Could not find an allocated subnet for node: osnode01.bdteam.local, Waiting...
[root@osnode01 centos]#

Logs output from one of the node (/var/log/messages)

Oct  6 20:41:15 osnode01 NetworkManager[18586]: <info>  [1507322475.5434] dhcp4 (eth0):   address 10.30.1.43
Oct  6 20:41:15 osnode01 NetworkManager[18586]: <info>  [1507322475.5434] dhcp4 (eth0):   plen 24 (255.255.255.0)
Oct  6 20:41:15 osnode01 NetworkManager[18586]: <info>  [1507322475.5434] dhcp4 (eth0):   gateway 10.30.1.1
Oct  6 20:41:15 osnode01 NetworkManager[18586]: <info>  [1507322475.5434] dhcp4 (eth0):   lease time 3600
Oct  6 20:41:15 osnode01 NetworkManager[18586]: <info>  [1507322475.5434] dhcp4 (eth0):   hostname 'ip-10-30-1-43'
Oct  6 20:41:15 osnode01 NetworkManager[18586]: <info>  [1507322475.5435] dhcp4 (eth0):   nameserver '10.21.0.251'
Oct  6 20:41:15 osnode01 NetworkManager[18586]: <info>  [1507322475.5435] dhcp4 (eth0):   domain name 'bdteam.local'
Oct  6 20:41:15 osnode01 NetworkManager[18586]: <info>  [1507322475.5435] dhcp4 (eth0): state changed bound -> bound
Oct  6 20:41:15 osnode01 dbus-daemon: dbus[632]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Oct  6 20:41:15 osnode01 dbus[632]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Oct  6 20:41:15 osnode01 systemd: Starting Network Manager Script Dispatcher Service...
Oct  6 20:41:15 osnode01 dhclient[18622]: bound to 10.30.1.43 -- renewal in 1686 seconds.
Oct  6 20:41:15 osnode01 dbus[632]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct  6 20:41:15 osnode01 dbus-daemon: dbus[632]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct  6 20:41:15 osnode01 systemd: Started Network Manager Script Dispatcher Service.
Oct  6 20:41:15 osnode01 nm-dispatcher: req:1 'dhcp4-change' [eth0]: new request (6 scripts)
Oct  6 20:41:15 osnode01 nm-dispatcher: req:1 'dhcp4-change' [eth0]: start running ordered scripts...
Oct  6 20:41:15 osnode01 nm-dispatcher: + cd /etc/sysconfig/network-scripts
Oct  6 20:41:15 osnode01 nm-dispatcher: + . ./network-functions
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ PATH=/sbin:/usr/sbin:/bin:/usr/bin
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ export PATH
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ hostname
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ HOSTNAME=osnode01.bdteam.local
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ '[' -z '' ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ . /etc/init.d/functions
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ TEXTDOMAIN=initscripts
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ umask 022
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ PATH=/sbin:/usr/sbin:/bin:/usr/bin
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ export PATH
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' 56720 -ne 1 -a -z '' ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' -d /run/systemd/system ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ case "$0" in
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' -z '' ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ COLUMNS=80
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' -z '' ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' -c /dev/stderr -a -r /dev/stderr ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ CONSOLETYPE=serial
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' -z '' ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' -z '' ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' -f /etc/sysconfig/i18n -o -f /etc/locale.conf ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ . /etc/profile.d/lang.sh
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ unset LANGSH_SOURCED
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' -z '' ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' -f /etc/sysconfig/init ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ . /etc/sysconfig/init
Oct  6 20:41:15 osnode01 nm-dispatcher: ++++ BOOTUP=color
Oct  6 20:41:15 osnode01 nm-dispatcher: ++++ RES_COL=60
Oct  6 20:41:15 osnode01 nm-dispatcher: ++++ MOVE_TO_COL='echo -en \033[60G'
Oct  6 20:41:15 osnode01 nm-dispatcher: ++++ SETCOLOR_SUCCESS='echo -en \033[0;32m'
Oct  6 20:41:15 osnode01 nm-dispatcher: ++++ SETCOLOR_FAILURE='echo -en \033[0;31m'
Oct  6 20:41:15 osnode01 nm-dispatcher: ++++ SETCOLOR_WARNING='echo -en \033[0;33m'
Oct  6 20:41:15 osnode01 nm-dispatcher: ++++ SETCOLOR_NORMAL='echo -en \033[0;39m'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' serial = serial ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ BOOTUP=serial
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ MOVE_TO_COL=
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ SETCOLOR_SUCCESS=
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ SETCOLOR_FAILURE=
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ SETCOLOR_WARNING=
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ SETCOLOR_NORMAL=
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ __sed_discard_ignored_files='/\(~\|\.bak\|\.orig\|\.rpmnew\|\.rpmorig\|\.rpmsave\)$/d'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' '' = 1 ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: ++++ cat /proc/cmdline
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ strstr 'BOOT_IMAGE=/boot/vmlinuz-3.10.0-693.2.2.el7.x86_64 root=UUID=29342a0b-e20f-4676-9ecf-dfdf02ef6683 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto console=ttyS0,115200 LANG=en_US.UTF-8' rc.debug
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ '[' 'BOOT_IMAGE=/boot/vmlinuz-3.10.0-693.2.2.el7.x86_64 root=UUID=29342a0b-e20f-4676-9ecf-dfdf02ef6683 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto console=ttyS0,115200 LANG=en_US.UTF-8' = 'BOOT_IMAGE=/boot/vmlinuz-3.10.0-693.2.2.el7.x86_64 root=UUID=29342a0b-e20f-4676-9ecf-dfdf02ef6683 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto console=ttyS0,115200 LANG=en_US.UTF-8' ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ return 1
Oct  6 20:41:15 osnode01 nm-dispatcher: +++ return 0
Oct  6 20:41:15 osnode01 nm-dispatcher: + '[' -f ../network ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: + . ../network
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ NETWORKING=yes
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ NOZEROCONF=yes
Oct  6 20:41:15 osnode01 nm-dispatcher: + [[ dhcp4-change =~ ^(up|dhcp4-change|dhcp6-change)$ ]]
Oct  6 20:41:15 osnode01 nm-dispatcher: + NEEDS_RESTART=0
Oct  6 20:41:15 osnode01 nm-dispatcher: + UPSTREAM_DNS=/etc/dnsmasq.d/origin-upstream-dns.conf
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ mktemp
Oct  6 20:41:15 osnode01 nm-dispatcher: + UPSTREAM_DNS_TMP=/tmp/tmp.5DzdaQo1tn
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ mktemp
Oct  6 20:41:15 osnode01 nm-dispatcher: + UPSTREAM_DNS_TMP_SORTED=/tmp/tmp.Ie4FFsjAgL
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ mktemp
Oct  6 20:41:15 osnode01 nm-dispatcher: + CURRENT_UPSTREAM_DNS_SORTED=/tmp/tmp.0ZlG7MgcgO
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ mktemp
Oct  6 20:41:15 osnode01 nm-dispatcher: + NEW_RESOLV_CONF=/tmp/tmp.293w7YIsqD
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ mktemp
Oct  6 20:41:15 osnode01 nm-dispatcher: + NEW_NODE_RESOLV_CONF=/tmp/tmp.D9exxlKVYt
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ /sbin/ip route list match 0.0.0.0/0
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ awk '{print $3 }'
Oct  6 20:41:15 osnode01 nm-dispatcher: + def_route=10.30.1.1
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ /sbin/ip route get to 10.30.1.1
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ awk '{print $3}'
Oct  6 20:41:15 osnode01 nm-dispatcher: + def_route_int=eth0
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ /sbin/ip route get to 10.30.1.1
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ awk '{print $5}'
Oct  6 20:41:15 osnode01 nm-dispatcher: + def_route_ip=10.30.1.43
Oct  6 20:41:15 osnode01 nm-dispatcher: + [[ eth0 == eth0 ]]
Oct  6 20:41:15 osnode01 nm-dispatcher: + '[' '!' -f /etc/dnsmasq.d/origin-dns.conf ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: + grep -q 99-origin-dns.sh /etc/resolv.conf
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ systemctl -q is-active dnsmasq.service
Oct  6 20:41:15 osnode01 nm-dispatcher: + '[' 0 -eq 1 ']'
Oct  6 20:41:15 osnode01 nm-dispatcher: ++ systemctl -q is-active dnsmasq.service
Oct  6 20:41:15 osnode01 nm-dispatcher: + grep -q 99-origin-dns.sh /etc/resolv.conf
Oct  6 20:41:15 osnode01 nm-dispatcher: + sed -e '/^nameserver.*$/d' /etc/resolv.conf
Oct  6 20:41:15 osnode01 nm-dispatcher: + echo 'nameserver 10.30.1.43'
Oct  6 20:41:15 osnode01 nm-dispatcher: + grep -q 'search.*cluster.local' /tmp/tmp.293w7YIsqD
Oct  6 20:41:15 osnode01 nm-dispatcher: + grep -qw search /tmp/tmp.293w7YIsqD
Oct  6 20:41:15 osnode01 nm-dispatcher: + cp -Z /tmp/tmp.293w7YIsqD /etc/resolv.conf
Oct  6 20:41:15 osnode01 nm-dispatcher: + rm -f /tmp/tmp.5DzdaQo1tn /tmp/tmp.Ie4FFsjAgL /tmp/tmp.0ZlG7MgcgO /tmp/tmp.293w7YIsqD
Oct  6 20:41:18 osnode01 origin-node: I1006 20:41:18.210035   56657 aws.go:936] Could not determine public DNS from AWS metadata.
Oct  6 20:41:18 osnode01 origin-node: W1006 20:41:18.246426   56657 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Oct  6 20:41:18 osnode01 origin-node: E1006 20:41:18.246581   56657 kubelet.go:2072] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Oct  6 20:41:20 osnode01 origin-node: W1006 20:41:20.737092   56657 sdn_controller.go:38] Could not find an allocated subnet for node: osnode01.bdteam.local, Waiting...
Oct  6 20:41:20 osnode01 origin-node: F1006 20:41:20.737146   56657 node.go:309] error: SDN node startup failed: failed to get subnet for this host: osnode01.bdteam.local, error: timed out waiting for the condition
Oct  6 20:41:20 osnode01 systemd: origin-node.service: main process exited, code=exited, status=255/n/a
Oct  6 20:41:20 osnode01 dnsmasq[18837]: setting upstream servers from DBus
Oct  6 20:41:20 osnode01 dnsmasq[18837]: using nameserver 10.21.0.251#53
Oct  6 20:41:20 osnode01 dbus-daemon: dbus[632]: [system] Rejected send message, 0 matched rules; type="method_return", sender=":1.7943" (uid=0 pid=18837 comm="/usr/sbin/dnsmasq -k ") interface="(unset)" member="(unset)" error name="(unset)" requested_reply="0" destination=":1.9458" (uid=0 pid=56795 comm="/usr/bin/dbus-send --system --dest=uk.org.thekelle")
Oct  6 20:41:20 osnode01 dbus[632]: [system] Rejected send message, 0 matched rules; type="method_return", sender=":1.7943" (uid=0 pid=18837 comm="/usr/sbin/dnsmasq -k ") interface="(unset)" member="(unset)" error name="(unset)" requested_reply="0" destination=":1.9458" (uid=0 pid=56795 comm="/usr/bin/dbus-send --system --dest=uk.org.thekelle")
Oct  6 20:41:20 osnode01 systemd: Failed to start OpenShift Node.
Oct  6 20:41:20 osnode01 systemd: Unit origin-node.service entered failed state.
Oct  6 20:41:20 osnode01 systemd: origin-node.service failed.

danwinship · 2017-10-09T14:44:20Z

This looks like the bug that was fixed by #15865. The SDN code assumed that kubelet registration would happen "quickly", and would bail out with an error if it didn't, but apparently sometimes it takes a long time.

Workaround: bring the node up for the first time with networkConfig.networkPluginName set to "". Wait for it to register itself with the master (ie, wait until oc get nodes shows that node). Then kill it, set the plugin name back to what it was before, and start OpenShift again. It should start fine (and should keep starting fine in the future; the problem only happens when bringing up a new node for the first time).

danwinship · 2017-10-09T14:51:28Z

Oh, looking more carefully, @poonia0arun 's bug is not necessarily the same as the original bug report.

The original bug report is fixed by #15865.

Node registers itself with Master but it can't start node service.
...
If I dont use any cloud provider in my ansible config.yml file my installation works fine but I need to resolve this for AWS or any cloud provider

That could be a problem with public vs private IP addresses. Look at openshift_hostname/openshift_public_hostname and openshift_ip/openshift_public_ip in the advanced install guide. The nodes need to know each other's cloud-internal IPs, not just the internet-facing IPs.

apooniajjn · 2017-10-09T15:23:48Z

@danwinship that's true.. I am not using any SDN plugin specifically.. I believe ansible installation uses a default SDN and as you can see from error it's unable to initialize cni ...which result in failure of openshift origin installation

gabemontero · 2017-10-09T15:40:38Z

I'm good with closing the issue then as fixed by #15865

gabemontero added area/infrastructure kind/test-flake Categorizes issue or PR as related to test flakes. priority/P2 labels Jul 19, 2017

bparees assigned dobbymoodge Jul 19, 2017

rhcarvalho mentioned this issue Jul 20, 2017

Remove dependency on playbook2image, rebase directly on OS. openshift/openshift-ansible#4742

Merged

13 tasks

gabemontero mentioned this issue Jul 20, 2017

update PR test readme, scripts openshift/jenkins-plugin#145

Merged

gabemontero closed this as completed Oct 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openshift_node: "restart node" ansible task fails sometimes #15332

openshift_node: "restart node" ansible task fails sometimes #15332

gabemontero commented Jul 19, 2017

stevekuznetsov commented Jul 19, 2017

gabemontero commented Jul 19, 2017

apooniajjn commented Oct 6, 2017

stevekuznetsov commented Oct 6, 2017

mrunalp commented Oct 6, 2017

apooniajjn commented Oct 6, 2017 •

edited

Loading

danwinship commented Oct 9, 2017

danwinship commented Oct 9, 2017

apooniajjn commented Oct 9, 2017

gabemontero commented Oct 9, 2017

openshift_node: "restart node" ansible task fails sometimes #15332

openshift_node: "restart node" ansible task fails sometimes #15332

Comments

gabemontero commented Jul 19, 2017

stevekuznetsov commented Jul 19, 2017

gabemontero commented Jul 19, 2017

apooniajjn commented Oct 6, 2017

stevekuznetsov commented Oct 6, 2017

mrunalp commented Oct 6, 2017

apooniajjn commented Oct 6, 2017 • edited Loading

Ansible Hosts File

Ansible Task fails during restarting the node services on each nodes/masters

danwinship commented Oct 9, 2017

danwinship commented Oct 9, 2017

apooniajjn commented Oct 9, 2017

gabemontero commented Oct 9, 2017

apooniajjn commented Oct 6, 2017 •

edited

Loading