Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of task panics #4698

Open
0x009922 opened this issue Jun 6, 2024 · 1 comment · May be fixed by #4716
Open

Improve handling of task panics #4698

0x009922 opened this issue Jun 6, 2024 · 1 comment · May be fixed by #4716
Assignees
Labels
Enhancement New feature or request UI Something about the interface

Comments

@0x009922
Copy link
Contributor

0x009922 commented Jun 6, 2024

Currently Iroha spawns many tokio::tasks in a detached way, without checking whether they panic somewhere along execution.

Here are a few examples (just cases when the handle is completely ignored; there are many other when the handle is used, but not to check whether it resolves in an error):

tokio::task::spawn(network.run());

Self::spawn_config_updates_broadcasting(kiso.clone(), logger.clone());

tokio::task::spawn(self.run());

let _handle = iroha_telemetry::ws::start(config.clone(), receiver)

It has a few issues:

  • When some vital task panics, we don't even see that it happened, and don't see an error message.
  • It results in a non-graceful "crumbling" of the system due to some tasks relying on failed ones.

For example, in case of the NetworkBase task panicking, we don't see anything except NetworkBase must accept messages until there is at least one handle to it: SendError { .. } (which doesn't tell the cause).

Instead, it would be good:

  • To see panic messages of most tasks (ideally - of all tasks)
  • When a vital task panics, gracefully shutdown all other tasks
  • Ideally (overkill for now), adopt Supervision Tree design principle: monitor tasks execution in a centralised way with restart/shutdown strategies. This must contribute into overall fault-tolerance of Iroha significantly. For more info, see Erlang/OTP or Supervisor in Elixir

Tools to help: TaskTracker, JoinSet, CancellationToken.

@0x009922 0x009922 added Enhancement New feature or request UI Something about the interface labels Jun 6, 2024
@Erigara
Copy link
Contributor

Erigara commented Jun 6, 2024

There is also a issue that panics SHOULD be detected by panic set_hook which should print panic and shutdown iroha, but there is problems with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request UI Something about the interface
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants