-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop existing modules when iotedged starts #3299
Conversation
It would be worth making it clear (in both the commit message and the code comment) that this situation happens specifically when the socket is created by iotedged rather than using systemd socket activation. With systemd socket activation the socket file is reused when iotedged restarts, but without socket activation iotedged has to bind it itself, which means it has to unlink it first, which means it gets disconnected from the modules. |
Updated, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about some corner cases. Thanks
// for the workload and management APIs. On some platforms (e.g. CentOS 7.5), calls | ||
// to these APIs will begin to fail. Resilient modules should be able to deal with | ||
// this, but we'll restart all modules to ensure a clean start. | ||
const STOP_TIME: Duration = Duration::from_secs(30); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in case any module failed to be stopped within the 30 seconds, we'll fail edgelet being started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. 30 seconds is an arbitrary value, so we can adjust that based on feedback. But if we don't fail iotedged in this case, then there are likely to be downstream problems that are more difficult to diagnose, because the errors observed (errors in this zombie module that can't be stopped) will be farther away from the source of the problem.
_ => Err(err), | ||
}) | ||
.or_else(|err| match Fail::find_root_cause(&err).downcast_ref::<ErrorKind>() { | ||
Some(ErrorKind::NotFound(_)) | Some(ErrorKind::NotModified) => Ok(()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any other error code? Sometimes docker container can hang that the runtime is not able to stop it; or docker itself hangs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking through the errors returned from stop, I think these are the two we want to ignore. Other errors signal something more sinister, and I think we want to fail in those cases.
Normally iotedged will stop all modules when it shuts down. But if it crashed, modules will continue to run. On Linux systems where iotedged is responsible for creating/binding the socket (e.g., CentOS 7.5, which uses systemd but does not support systemd socket activation), modules will be left holding stale file descriptors for the workload and management APIs and calls on these APIs will begin to fail. It is expected that modules will be resilient to these sorts of failures, but in reality sometimes they aren't. This change updates iotedged to stop any existing modules when it is starting. The stopped modules will be started again naturally once iotedged (and Edge Agent) are running again.
Normally iotedged will stop all modules when it shuts down. But if it crashed, modules will continue to run. On Linux systems where iotedged is responsible for creating/binding the socket (e.g., CentOS 7.5, which uses systemd but does not support systemd socket activation), modules will be left holding stale file descriptors for the workload and management APIs and calls on these APIs will begin to fail. It is expected that modules will be resilient to these sorts of failures, but in reality sometimes they aren't. This change updates iotedged to stop any existing modules when it is starting. The stopped modules will be started again naturally once iotedged (and Edge Agent) are running again. change tests to unix change tests to unix adding cfg attribute for linux only target adding cfg atrribute for namedtemplate tempdir fmt
Normally iotedged will stop all modules when it shuts down. But if it crashed, modules will continue to run. On Linux systems where iotedged is responsible for creating/binding the socket (e.g., CentOS 7.5, which uses systemd but does not support systemd socket activation), modules will be left holding stale file descriptors for the workload and management APIs and calls on these APIs will begin to fail. It is expected that modules will be resilient to these sorts of failures, but in reality sometimes they aren't.
This change updates iotedged to stop any existing modules when it is starting. The stopped modules will be started again naturally once iotedged (and Edge Agent) are running again.