Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network resilience to disconnects #188

Closed
2 tasks done
ch1bo opened this issue Jan 30, 2022 · 14 comments
Closed
2 tasks done

Network resilience to disconnects #188

ch1bo opened this issue Jan 30, 2022 · 14 comments
Labels
amber ⚠️ Medium complexity or partly unclear feature L2 Affect off-chain part of the Head protocol/network 💬 feature A feature on our roadmap
Milestone

Comments

@ch1bo
Copy link
Collaborator

ch1bo commented Jan 30, 2022

Why

The Hydra Head becomes stuck very easily which is bad for user experience. The state machine in the HeadLogic does not make progress as it is waiting from some "signal" from peers, or from the chain. This can happen from a wide variety of reasons: When the connection between two hydra-nodes breaks down, or when one node crashes and restarts.
When the Head stalls because of missing responses, it currently needs to be closed & re-opened to continue operation.

This issue specifically wants to address one source of "stalling", namely the transient network partitioning between peers.

What

What kind of resilience do we expect:

  1. When a node fails and never restart, one of the peers must close the head "manually": You can't recover from permanent failures 😛
  2. ✅ If a node is disconnected from the other peers and reconnects (network partition failure mode), the head should still be live
  3. 🔴 When a node fails and then restart (crash-recovery failure mode) the head should still be live
  4. 🔴 When a node starts to misbehave (eg. Byzantine failure mode), the head should be closed manually: We don't provide any BFT guarantee, as we assume it's always possible to close the head with latest known snapshot.

How

This is a large feature and therefore we want to split it in several deliveries:

Non goals

  • Prevent submitting new transactions while not connected to all (configured?) peers
    • This is requiring the (much harder) understanding of "being connected"
    • Is this a problem or not? A head can only progress through snapshots, but in the case of a peer being disconnected, a snapshot can be pending for a while and we can accumulate unconfirmed txs
    • It's probably an application/client concern: A client can know that a snapshot hasn't been produced for a while and even not use the PeerConnected/Disconnected messages we send them
  • What about the HeadLogic, will it loop on old message requests? How long to re-enqueue messages (also applies to Wait outcomes)?
    • We realised that we can handle this at the Network layer without touching the HeadLogic. In the case of crash-recovery, the HeadLogic will come back at the same state it was before, and the only concern is about "in-flight" network messages that might have been lost
    • We also realised that the original protocol from the paper was not using broadcast but unicast (or multicast for some messages) and had forms of acknowledgement for both tx and snapshots. Using broadcast allowed to simplify the protocol, assuming everyone received every message and therefore could 1/ not need tx confirmation and 2/ compute snapshot locally, but it introduced the inability to acknowledge messages to peers
  • Can we use model tests with fuzziness to test this? (chaos monkeys) -> Yes
@ch1bo ch1bo added 💬 feature A feature on our roadmap network amber ⚠️ Medium complexity or partly unclear feature and removed network labels Jan 30, 2022
@ch1bo ch1bo changed the title Re-establish off-chain connections Resubmit network messages Mar 22, 2022
@ch1bo
Copy link
Collaborator Author

ch1bo commented May 31, 2022

We have not experienced this much (might be more relevant if we would have UDP transport) and have no concrete user requests for this -> prioritize lower, still aiming 1.0.0 but maybe not.

@ch1bo ch1bo changed the title Resubmit network messages Hydra network failure resilience May 31, 2022
@ch1bo ch1bo added this to the Mainnet milestone Feb 9, 2023
@pgrange pgrange removed this from the Mainnet milestone Feb 21, 2023
@ch1bo
Copy link
Collaborator Author

ch1bo commented Mar 13, 2023

See #612 for an investigation of this loss of liveness.

@uhbif19
Copy link
Contributor

uhbif19 commented May 26, 2023

Longer down times (depending on the contestation period Hydra protocol parameter) are not covered!

Why?

Cannot you just establish timeout policy on the matter, and Abort Head after consensus unreachable long enough?

@ch1bo
Copy link
Collaborator Author

ch1bo commented May 26, 2023

Cannot you just establish timeout policy on the matter, and Abort Head after consensus unreachable long enough

I think we don't count closing the head as "covering network failures".

@uhbif19
Copy link
Contributor

uhbif19 commented May 28, 2023

@ch1bo

I think we don't count closing the head as "covering network failures".

But that seems like the only option to react to network failure long enogh.

Could that matter be covered by other issue when?

@ch1bo
Copy link
Collaborator Author

ch1bo commented May 28, 2023

Of courses could this be covered by another issue, it's just that this issue was not intended to be a out this case. I don't think we have a current item for this though.. maybe also because that time out of non-progress of a head could be detected & handled by the application running on top of Hydra. Network issues are not the only source of non-progress

@uhbif19
Copy link
Contributor

uhbif19 commented Jun 8, 2023

@ch1bo

Network issues are not the only source of non-progress

Yes, and that possibility is integral part of consensus.

maybe also because that time out of non-progress of a head could be detected & handled by the application running on top of Hydra.

I am not sure that this will be good approach for API. I think such timeout is dependent on internal details on consensus and Hydra may implement different strategies on that depending on some internal information.

  1. That requires Hydra client to have some state & timers, while it can be completely stateless otherwise. Also in such case you should somehow state it in docs, for clients to know that they should handle such cases.
  2. Timeouts before calling Halting for such client will depend tx time for on L1 and their finality criteria used by Hydra. And how many tx on L1 is performed too, which may change with Hydra releases. So client may need to change such timeouts at least for different networks and Hydra releases.
  3. You handle L1 rollbacks. I think that affects timeouts as well, while client does not know if rollbacks happened.
  4. Hydra consensus may change in future changing this behavior as well.

The upside client-side Halting is that one could easily change such behavior. But you can achieve same with option to disable Hydra-side halting and/or to provide message if Hydra server thinks consensus reached timeout (and thus recommends client to perform Halting).

@ch1bo
Copy link
Collaborator Author

ch1bo commented Jun 12, 2023

@uhbif19 Bullets 2 and 3 indicate you are thinking about this not only on the "Hydra network" level. Which is fine itself, the things you mention all have an influence in liveness one way or another.

However, within this feature, we want to take one concrete step into improving the situation by re-submitting Hydra network messages - that is, the L2 protocol for reaching consensus in a Hydra head. This already requires grooming & planning and some open questions still remain (see "to be discussed" in original post)

@uhbif19
Copy link
Contributor

uhbif19 commented Jun 13, 2023

@ch1bo Yes, of course, restricting scope is important, you right. I just wanted to record my thoughts on this, not necessarily as part of this issue.

@ch1bo
Copy link
Collaborator Author

ch1bo commented Aug 30, 2023

We have drawn up a pull-based workflow in a sequence diagram today (@pgrange please provide more context from your write-up)

sequenceDiagram
    Alice->>A: broadcast msg1
    Alice->>Alice: msg1
    Alice->>A: broadcast msg2
    Alice->>Alice: msg2

    Note over B: start B network stack
    B-->>A: connect
    note left of A: after seeing any message
    A->>Alice: PeerConnected

    note over A: concurrently
    A-->>B: connect
    note over A: readIndex B == 1
    A->>B: Send msg1
    A->>B: Send msg2

    B->>Bob: callback msg1
    B->>A: Ack msg1
    A->>A: readIndex B = 2
    Bob->>Bob: protocol logic
    note over B: crashes

    note over A: detects connection down (how?)

    A-->>B: connect
    note over A: readIndex B == 2
    A->>+B: Send msg2
    B->>Bob: callback msg2
    B->>-A: Ack msg2
    A->>A: readIndex B = 3
    Bob->>Bob: protocol logic
Loading

@abailly-iohk
Copy link
Contributor

https://hackmd.io/c/tutorials/%2Fs%2FMathJax-and-UML#UML-Diagrams provides UML diagrams, include sequence diagrams. I propose we use such a document to collaborate on the design of this networking protocol.

@pgrange
Copy link
Contributor

pgrange commented Aug 31, 2023

Here is a draft PR with a protocol specification proposal to comment: #1050

@ch1bo ch1bo added the L2 Affect off-chain part of the Head protocol/network label Sep 19, 2023
@ch1bo ch1bo changed the title Hydra network failure resilience Hydra network failure resilience to disconnects Sep 19, 2023
@ch1bo ch1bo changed the title Hydra network failure resilience to disconnects Network resilience to disconnects Sep 20, 2023
@ch1bo ch1bo added this to the 0.13.0 milestone Sep 20, 2023
@ch1bo
Copy link
Collaborator Author

ch1bo commented Sep 27, 2023

IMO this feature should cover this scenario to be of value to our users: #1074 (review) That is:

  • Start the demo devnet with 3 participants and open a head
  • Submit a transaction via carol -> See SnapshotConfirmed #1
  • Stop node of alice
  • Submit another transaction via carol -> See only TxValid
  • Start node of alice -> expect it to catch up and result in SnapshotConfirmed #2
  • Submitting yet another transaction by carol -> Should see SnapshotConfirmed #3

@abailly-iohk abailly-iohk modified the milestones: 0.13.0, 0.14.0 Oct 3, 2023
@abailly-iohk
Copy link
Contributor

Removing #1080 as part of this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
amber ⚠️ Medium complexity or partly unclear feature L2 Affect off-chain part of the Head protocol/network 💬 feature A feature on our roadmap
Projects
None yet
Development

No branches or pull requests

5 participants