Stop existing modules when iotedged starts #3299

damonbarry · 2020-07-27T23:00:06Z

Normally iotedged will stop all modules when it shuts down. But if it crashed, modules will continue to run. On Linux systems where iotedged is responsible for creating/binding the socket (e.g., CentOS 7.5, which uses systemd but does not support systemd socket activation), modules will be left holding stale file descriptors for the workload and management APIs and calls on these APIs will begin to fail. It is expected that modules will be resilient to these sorts of failures, but in reality sometimes they aren't.

This change updates iotedged to stop any existing modules when it is starting. The stopped modules will be started again naturally once iotedged (and Edge Agent) are running again.

arsing · 2020-07-28T23:44:26Z

On some platforms (e.g. CentOS 7.5)

It would be worth making it clear (in both the commit message and the code comment) that this situation happens specifically when the socket is created by iotedged rather than using systemd socket activation. With systemd socket activation the socket file is reused when iotedged restarts, but without socket activation iotedged has to bind it itself, which means it has to unlink it first, which means it gets disconnected from the modules.

damonbarry · 2020-07-29T00:06:25Z

It would be worth making it clear (in both the commit message and the code comment) that this situation happens specifically when...

Updated, thanks!

CindyXing

I am curious about some corner cases. Thanks

CindyXing · 2020-07-29T01:09:40Z

edgelet/iotedged/src/lib.rs

+                // for the workload and management APIs. On some platforms (e.g. CentOS 7.5), calls
+                // to these APIs will begin to fail. Resilient modules should be able to deal with
+                // this, but we'll restart all modules to ensure a clean start.
+                const STOP_TIME: Duration = Duration::from_secs(30);


So, in case any module failed to be stopped within the 30 seconds, we'll fail edgelet being started?

Correct. 30 seconds is an arbitrary value, so we can adjust that based on feedback. But if we don't fail iotedged in this case, then there are likely to be downstream problems that are more difficult to diagnose, because the errors observed (errors in this zombie module that can't be stopped) will be farther away from the source of the problem.

CindyXing · 2020-07-29T01:11:05Z

edgelet/edgelet-docker/src/runtime.rs

-                    _ => Err(err),
-                })
+                .or_else(|err| match Fail::find_root_cause(&err).downcast_ref::<ErrorKind>() {
+                        Some(ErrorKind::NotFound(_)) | Some(ErrorKind::NotModified) => Ok(()),


Would there be any other error code? Sometimes docker container can hang that the runtime is not able to stop it; or docker itself hangs.

After looking through the errors returned from stop, I think these are the two we want to ignore. Other errors signal something more sinister, and I think we want to fail in those cases.

Normally iotedged will stop all modules when it shuts down. But if it crashed, modules will continue to run. On Linux systems where iotedged is responsible for creating/binding the socket (e.g., CentOS 7.5, which uses systemd but does not support systemd socket activation), modules will be left holding stale file descriptors for the workload and management APIs and calls on these APIs will begin to fail. It is expected that modules will be resilient to these sorts of failures, but in reality sometimes they aren't. This change updates iotedged to stop any existing modules when it is starting. The stopped modules will be started again naturally once iotedged (and Edge Agent) are running again.

Normally iotedged will stop all modules when it shuts down. But if it crashed, modules will continue to run. On Linux systems where iotedged is responsible for creating/binding the socket (e.g., CentOS 7.5, which uses systemd but does not support systemd socket activation), modules will be left holding stale file descriptors for the workload and management APIs and calls on these APIs will begin to fail. It is expected that modules will be resilient to these sorts of failures, but in reality sometimes they aren't. This change updates iotedged to stop any existing modules when it is starting. The stopped modules will be started again naturally once iotedged (and Edge Agent) are running again. change tests to unix change tests to unix adding cfg attribute for linux only target adding cfg atrribute for namedtemplate tempdir fmt

damonbarry added 4 commits July 27, 2020 15:45

Stop existing modules when iotedged starts

5ce6f4c

Don't fail stop_all if module is already stopped

50532b1

Take 2: Don't fail stop_all if module is already stopped

b31cfb2

cargo fmt

5767264

damonbarry marked this pull request as ready for review July 28, 2020 23:37

damonbarry requested a review from arsing July 28, 2020 23:37

Update comment to clarify fix's impact

0e78b45

arsing approved these changes Jul 29, 2020

View reviewed changes

damonbarry added the ready-to-merge label Jul 29, 2020

Merge branch 'master' into edgelet-restart-modules

efa0a63

CindyXing reviewed Jul 29, 2020

View reviewed changes

damonbarry added ready-to-merge and removed ready-to-merge labels Jul 29, 2020

kodiakhq bot merged commit 2d1a609 into Azure:master Jul 29, 2020

damonbarry deleted the edgelet-restart-modules branch July 29, 2020 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop existing modules when iotedged starts #3299

Stop existing modules when iotedged starts #3299

damonbarry commented Jul 27, 2020 •

edited

Loading

arsing commented Jul 28, 2020

damonbarry commented Jul 29, 2020

CindyXing left a comment

CindyXing Jul 29, 2020

damonbarry Jul 29, 2020

CindyXing Jul 29, 2020

damonbarry Jul 29, 2020

Stop existing modules when iotedged starts #3299

Stop existing modules when iotedged starts #3299

Conversation

damonbarry commented Jul 27, 2020 • edited Loading

arsing commented Jul 28, 2020

damonbarry commented Jul 29, 2020

CindyXing left a comment

Choose a reason for hiding this comment

CindyXing Jul 29, 2020

Choose a reason for hiding this comment

damonbarry Jul 29, 2020

Choose a reason for hiding this comment

CindyXing Jul 29, 2020

Choose a reason for hiding this comment

damonbarry Jul 29, 2020

Choose a reason for hiding this comment

damonbarry commented Jul 27, 2020 •

edited

Loading