Skip to content
This repository has been archived by the owner on Nov 29, 2017. It is now read-only.

Sometimes the "beam" processes can't stop completely when stop the rabbit_node_ng job #170

Open
gu-bin opened this issue Mar 30, 2016 · 1 comment
Labels

Comments

@gu-bin
Copy link

gu-bin commented Mar 30, 2016

When running "bosh deploy" to upgrade the rabbit_node_ng job, bosh will stop all the jobs on the vm first and then unmount the persistent disk. But sometimes stopping the jobs can't stop the "beam" processes (created by rabbit_node job), so the persistent disk is still used by "beam" and can't be unmounted. It will cause the bosh deploy fail. To fix it, we need to log in to the rabbit_node_ng vm and kill all the beam processes and re-run bosh deploy. With this problem, we can't run bosh deploy fully automatically without manual interference.

We investigated that the problem is caused by some warden processes (the father process of beam processes) can't be stopped by stopping rabbit_node job (/var/vcap/jobs/rabbit_node_ng/bin/rabbit_node_ctl stop). Normally, when bosh stops rabbit_node job, the warden processes (like wshd: 19gdipma38k) will be killed so the beam processes will be killed along with them. But sometimes the warden processes failed to be killed so the beam processes keep alive and occupy the persistent disk.

One way to fix this problem we can think of is to add the function to "/var/vcap/jobs/rabbit_node_ng/bin/rabbit_node_ctl stop" command to check if there are some warden processes still alive after "kill_and_wait $PIDFILE 60". If yes, kill them.

@gu-bin
Copy link
Author

gu-bin commented Apr 22, 2016

Is there any update?
/cc @maximilien

@maximilien maximilien added the bug label Apr 23, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants