-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dead plugin+systemd = dead docker #813
Comments
Similar report at moby/moby#17960 (comment), but user there is not using a plugin. |
I've seen this problem too while developing a new network plugin. I was on Debian 8 with systemd. The only way to recover from this was to restart my plugin, then restart docker 😭
But what should happen to those containers if the plugin is still unreachable? |
This implies one must never attempt to implement a plugin in a container. If that is a recommendation, it should be in the docs.
Leave them unstarted. |
Not sure it's a limitation - I haven't tried it with containers yet.
Works for me /cc @mrjana @mavenugo @aboch |
If the user has Also consider the case that the driver is crashing before handling any commands. |
@bboreham @dave-tucker there are many such cases and I have requested for a discussion with @calavera to get his inputs on docker plugin infra. To me, this issue is a manifestation of the way we treat plugin containers. instead of trying to band-aid this scenario, lets try and address it at an infra level. |
Thanks @mavenugo |
@mavenugo there was no intention of band-aiding on my side.
Please note the problem I reported at the top will happen just the same if your plugin is not in a container. |
@bboreham fair enough. But, one must note that |
OK, I see where you're coming from. To me, having no way to reconfigure this aspect of the engine when the engine is down is the fundamental problem. I.e. you can get in to this state via some |
Yes. Any such deadlock scenario must be avoided and is certainly a topic of discussion. |
Inspection of the code suggests the failure is happening on this code path:
which all happens strictly before the call to So it is impossible to have Docker survive a restart, when a plugin driver is implemented inside a container. |
To refine the problem, |
And, |
@bboreham the only reason we do sandboxCleanup is to cleanup any stale endpoints left behind due to an ungraceful daemon restart/shutdown. If it is graceful, this will never happen. Having said that, ungraceful restart can happen and we should try and do our best to cleanup the states. I think the right solution is to not load the plugin to cleanup the endpoint states during sandboxCleanup. Fortunately, I think I have a workaround for this. Let me try a simpler solution to see if we can get this resolved. |
I do |
@bboreham the ungraceful restart of docker daemon can happen in multiple ways. Essentially if the network sandbox is not cleaned up (which is possible only in an ungraceful daemon shutdown), this can happen. PTAL |
But why is docker trying to initialize a plugin if it's not been specifically requested yet? Or better said, why is Docker requesting a plugin before a container is even attempted to be started that needs it? |
@cpuguy83 this happens because libnetwork tries to cleanup any stale endpoint resource that are left behind due to an ungraceful restart of the daemon. If we dont perform the cleanup, these resources will be held (such as ip-address) and cannot be reused after the restart. For this cleanup, we reuse the same code-path and hence it is trying to access the driver. But this is a force cleanup case and we dont have to invoke the plugin for this cleanup. But in order to retain the same code-path it would be good to get #19390. if we dont have that, then it has to be managed entirely in libnetwork. |
It's just an incidental use of the function, while iterating though all local-scope endpoints. I'll try to make a PR to take it out. |
@bboreham it is not an incidental use and this should not be removed. We could definitely bypass the call to the driver during this force cleanup scenario. But the real problem is none of these. The problem is that plugins.Get is lazy and greedy and it automatically performs retry. Maybe we need a simple API in plugins pkg to return just the available plugins (without trying to load it). |
@mavenugo https://github.com/docker/docker/blob/master/pkg/plugins/plugins.go#L191 |
@cpuguy83 yes. I think I should also add a |
@cpuguy83 actually, i could use the |
OK, I tried my idea of just fixing the I tried your #880; it crashed. But I can no longer see your comment asking me to try it, so maybe you don't expect it to work yet. While doing all this, I noticed there is a similar problem during shutdown: if Docker shuts down the container running the plugin, then shuts down another container attached to that network, the call to |
@bboreham I tried a valid remote plugin manually and I hit the panic that you encountered. Have resolved that as well. Will do more validations and push the final fix. |
Thanks; I will test that later today. I raised the shutdown problem as a separate issue - #882 |
- Fixes moby#19404 - Fixes ungraceful daemon restart issue in systemd with remote network plugin (moby/libnetwork#813) Signed-off-by: Madhu Venugopal <madhu@docker.com>
- Fixes moby#19404 - Fixes ungraceful daemon restart issue in systemd with remote network plugin (moby/libnetwork#813) Signed-off-by: Madhu Venugopal <madhu@docker.com>
@mavenugo @bboreham can this be closed now that moby/moby#19465 was merged? |
Yes. |
Yes, thanks. |
- Fixes moby#19404 - Fixes ungraceful daemon restart issue in systemd with remote network plugin (moby/libnetwork#813) Signed-off-by: Madhu Venugopal <madhu@docker.com>
I am developing a Docker Network plugin, so from time to time it crashes and/or I stop it running. If, during this time, I restart the host, Docker will often get completely hosed.
It seems this is a result of Docker waiting 15 seconds per endpoint to try to talk to my plugin and systemd only waiting a limited time for Docker to get started. Also it appears to be timing out three times for the same endpoint (at least in this example):
Docker should provide some way for installations to recover from this situation - maybe a
--force-network-cleanup
flag ? Or do the retrying in the background after getting the main daemon up and running.The text was updated successfully, but these errors were encountered: