Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FVT] rinstall will hung when it failed to terminate sub process with SIGKILL #3094

Closed
hu-weihua opened this issue May 18, 2017 · 6 comments
Closed

Comments

@hu-weihua
Copy link

xcat version:
Version 2.13.4 (git commit 6ee3741, built Tue May 16 10:03:13 EDT 2017)

rinstall will hung when it failed to terminate sub process with SIGKILL

RUN:rinstall c910f03c11k11 osimage=sles12.2-ppc64le-install-compute  [Wed May 17 09:32:50 2017]

c910f03c11k09:~ # ps axjf |grep rinstall
14254 14909 14228 14228 ?           -1 S        0   0:09  |           \_ perl /opt/xcat/bin/rinstall c910f03c11k11 osimage=sles12.2-ppc64le-install-compute
28460 28662 28661 28460 pts/3    28661 S+       0   0:00          \_ grep --color=auto rinstall
14149 14914 14149 14149 ?           -1 S        0   0:18  \_ xcatd SSL: rinstall to c910f03c11k11 for root@l
14914 14915 14915 14149 ?           -1 S        0   0:00      \_ xcatd SSL: rinstall to c910f03c11k11 for root@l
14915 14968 14915 14149 ?           -1 S        0   0:00          \_ xcatd SSL: rinstall to c910f03c11k11 for root@l


c910f03c11k09:~ # strace -p  14968
Process 14968 attached
waitpid(-1, ^CProcess 14968 detached
 <detached ...>


c910f03c11k09:~ # cat /var/log/xcat/cluster.log
.........
May 17 09:32:50 c910f03c11k09 xcat[14914]:  xCAT: Allowing rinstall to c910f03c11k11 osimage=sles12.2-ppc64le-install-compute for root from localhost
May 17 09:33:07 c910f03c11k09 xcat[14974]:  xCAT: Allowing getcons to c910f03c11k11 text for root from localhost
May 17 09:33:10 c910f03c11k09 xcat[14968]:  xcatd: kvm plugin bug, pid 14968, process description: 'xcatd SSL: rinstall to c910f03c11k11 for root@localhost: rinstall instance: kvm instance' with error 'libvirt error code: 38, message: Failed to terminate process 99106 with SIGKILL: Device or resource busy#012' while trying to fulfill request for the following nodes: c910f03c11k11
 .......

rinstall should report this error and return, should not hung.

@immarvin
Copy link
Contributor

immarvin commented Jun 2, 2017

cannot be recreated in my env, need more investigation

@robin2008
Copy link
Member

@hu-weihua Do we have concrete reproduce steps? Or is this issue observed in our daily testing during 2.13.5

@zet809 zet809 modified the milestones: 2.13.6, 2.13.5 Jul 4, 2017
@zet809 zet809 removed this from the 2.13.6 milestone Jul 18, 2017
@hu-weihua
Copy link
Author

@immarvin and @robin2008, it is timing issue. If rinstall can execute every step ideally, everything is fine. But if it failed to terminate sub process, rinstall will hung forever. We have hit this problem many times in automation environment. I do not think it is a very small probability event.

immarvin added a commit to immarvin/xcat-core that referenced this issue Jul 19, 2017
…#3466;fix issue [FVT] rinstall will hung when it failed to terminate sub process with SIGKILL xcat2#3094
@immarvin
Copy link
Contributor

fixed in #3507

@immarvin
Copy link
Contributor

since this is a block issue in daily autotest, change the priority to "high"

@immarvin immarvin added this to the 2.13.6 milestone Jul 19, 2017
@immarvin immarvin added sprint2 and removed sprint1 labels Jul 19, 2017
robin2008 pushed a commit that referenced this issue Jul 19, 2017
…fix issue [FVT] rinstall will hung when it failed to terminate sub process with SIGKILL #3094 (#3507)
@immarvin
Copy link
Contributor

The verification is covered by daily autotest:

RUN:rinstall c910f03c11k13 osimage=sles12.2-ppc64le-install-service [Wed Jul 19 11:55:29 2017]
ElapsedTime:5 sec
RETURN rc = 1
OUTPUT:
Error: Failed to run 'rpower' against the following nodes: c910f03c11k13
Provision node(s): c910f03c11k13
c910f03c11k13: internal error: qemu unexpectedly closed the monitor: 2017-07-19T15:55:34.608881Z qemu-kvm: Failed to allocate KVM HPT of order 26 (try smaller maxmem?): Cannot allocate memory
CHECK:rc == 0   [Failed]

the internal exception of libvirt is successfully captured and reported by rinstall

close this ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants