Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop existing modules when iotedged starts #3299

Merged
merged 6 commits into from
Jul 29, 2020

Conversation

damonbarry
Copy link
Member

@damonbarry damonbarry commented Jul 27, 2020

Normally iotedged will stop all modules when it shuts down. But if it crashed, modules will continue to run. On Linux systems where iotedged is responsible for creating/binding the socket (e.g., CentOS 7.5, which uses systemd but does not support systemd socket activation), modules will be left holding stale file descriptors for the workload and management APIs and calls on these APIs will begin to fail. It is expected that modules will be resilient to these sorts of failures, but in reality sometimes they aren't.

This change updates iotedged to stop any existing modules when it is starting. The stopped modules will be started again naturally once iotedged (and Edge Agent) are running again.

@damonbarry damonbarry marked this pull request as ready for review July 28, 2020 23:37
@damonbarry damonbarry requested a review from arsing July 28, 2020 23:37
@arsing
Copy link
Member

arsing commented Jul 28, 2020

On some platforms (e.g. CentOS 7.5)

It would be worth making it clear (in both the commit message and the code comment) that this situation happens specifically when the socket is created by iotedged rather than using systemd socket activation. With systemd socket activation the socket file is reused when iotedged restarts, but without socket activation iotedged has to bind it itself, which means it has to unlink it first, which means it gets disconnected from the modules.

@damonbarry
Copy link
Member Author

It would be worth making it clear (in both the commit message and the code comment) that this situation happens specifically when...

Updated, thanks!

Copy link
Contributor

@CindyXing CindyXing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about some corner cases. Thanks

// for the workload and management APIs. On some platforms (e.g. CentOS 7.5), calls
// to these APIs will begin to fail. Resilient modules should be able to deal with
// this, but we'll restart all modules to ensure a clean start.
const STOP_TIME: Duration = Duration::from_secs(30);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in case any module failed to be stopped within the 30 seconds, we'll fail edgelet being started?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. 30 seconds is an arbitrary value, so we can adjust that based on feedback. But if we don't fail iotedged in this case, then there are likely to be downstream problems that are more difficult to diagnose, because the errors observed (errors in this zombie module that can't be stopped) will be farther away from the source of the problem.

_ => Err(err),
})
.or_else(|err| match Fail::find_root_cause(&err).downcast_ref::<ErrorKind>() {
Some(ErrorKind::NotFound(_)) | Some(ErrorKind::NotModified) => Ok(()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any other error code? Sometimes docker container can hang that the runtime is not able to stop it; or docker itself hangs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking through the errors returned from stop, I think these are the two we want to ignore. Other errors signal something more sinister, and I think we want to fail in those cases.

@kodiakhq kodiakhq bot merged commit 2d1a609 into Azure:master Jul 29, 2020
@damonbarry damonbarry deleted the edgelet-restart-modules branch July 29, 2020 16:08
damonbarry added a commit to damonbarry/iotedge that referenced this pull request Aug 7, 2020
Normally iotedged will stop all modules when it shuts down. But if it crashed, modules will continue to run. On Linux systems where iotedged is responsible for creating/binding the socket (e.g., CentOS 7.5, which uses systemd but does not support systemd socket activation), modules will be left holding stale file descriptors for the workload and management APIs and calls on these APIs will begin to fail. It is expected that modules will be resilient to these sorts of failures, but in reality sometimes they aren't.

This change updates iotedged to stop any existing modules when it is starting. The stopped modules will be started again naturally once iotedged (and Edge Agent) are running again.
ggjjj pushed a commit to ggjjj/iotedge that referenced this pull request Aug 20, 2020
Normally iotedged will stop all modules when it shuts down. But if it crashed, modules will continue to run. On Linux systems where iotedged is responsible for creating/binding the socket (e.g., CentOS 7.5, which uses systemd but does not support systemd socket activation), modules will be left holding stale file descriptors for the workload and management APIs and calls on these APIs will begin to fail. It is expected that modules will be resilient to these sorts of failures, but in reality sometimes they aren't.

This change updates iotedged to stop any existing modules when it is starting. The stopped modules will be started again naturally once iotedged (and Edge Agent) are running again.

change tests to unix

change tests to unix

adding cfg attribute for linux only target

adding cfg atrribute for namedtemplate

tempdir

fmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants