Network resilience to disconnects #188

ch1bo · 2022-01-30T17:25:45Z

Why

The Hydra Head becomes stuck very easily which is bad for user experience. The state machine in the HeadLogic does not make progress as it is waiting from some "signal" from peers, or from the chain. This can happen from a wide variety of reasons: When the connection between two hydra-nodes breaks down, or when one node crashes and restarts.
When the Head stalls because of missing responses, it currently needs to be closed & re-opened to continue operation.

This issue specifically wants to address one source of "stalling", namely the transient network partitioning between peers.

What

What kind of resilience do we expect:

When a node fails and never restart, one of the peers must close the head "manually": You can't recover from permanent failures 😛
✅ If a node is disconnected from the other peers and reconnects (network partition failure mode), the head should still be live
🔴 When a node fails and then restart (crash-recovery failure mode) the head should still be live
🔴 When a node starts to misbehave (eg. Byzantine failure mode), the head should be closed manually: We don't provide any BFT guarantee, as we assume it's always possible to close the head with latest known snapshot.

How

This is a large feature and therefore we want to split it in several deliveries:

Non goals

Prevent submitting new transactions while not connected to all (configured?) peers
- This is requiring the (much harder) understanding of "being connected"
- Is this a problem or not? A head can only progress through snapshots, but in the case of a peer being disconnected, a snapshot can be pending for a while and we can accumulate unconfirmed txs
- It's probably an application/client concern: A client can know that a snapshot hasn't been produced for a while and even not use the PeerConnected/Disconnected messages we send them
What about the HeadLogic, will it loop on old message requests? How long to re-enqueue messages (also applies to Wait outcomes)?
- We realised that we can handle this at the Network layer without touching the HeadLogic. In the case of crash-recovery, the HeadLogic will come back at the same state it was before, and the only concern is about "in-flight" network messages that might have been lost
- We also realised that the original protocol from the paper was not using broadcast but unicast (or multicast for some messages) and had forms of acknowledgement for both tx and snapshots. Using broadcast allowed to simplify the protocol, assuming everyone received every message and therefore could 1/ not need tx confirmation and 2/ compute snapshot locally, but it introduced the inability to acknowledge messages to peers
Can we use model tests with fuzziness to test this? (chaos monkeys) -> Yes

The text was updated successfully, but these errors were encountered:

ch1bo · 2022-05-31T12:48:50Z

We have not experienced this much (might be more relevant if we would have UDP transport) and have no concrete user requests for this -> prioritize lower, still aiming 1.0.0 but maybe not.

ch1bo · 2023-03-13T16:02:53Z

See #612 for an investigation of this loss of liveness.

uhbif19 · 2023-05-26T06:07:01Z

Longer down times (depending on the contestation period Hydra protocol parameter) are not covered!

Why?

Cannot you just establish timeout policy on the matter, and Abort Head after consensus unreachable long enough?

ch1bo · 2023-05-26T12:47:19Z

Cannot you just establish timeout policy on the matter, and Abort Head after consensus unreachable long enough

I think we don't count closing the head as "covering network failures".

uhbif19 · 2023-05-28T13:28:12Z

@ch1bo

I think we don't count closing the head as "covering network failures".

But that seems like the only option to react to network failure long enogh.

Could that matter be covered by other issue when?

ch1bo · 2023-05-28T16:40:25Z

Of courses could this be covered by another issue, it's just that this issue was not intended to be a out this case. I don't think we have a current item for this though.. maybe also because that time out of non-progress of a head could be detected & handled by the application running on top of Hydra. Network issues are not the only source of non-progress

uhbif19 · 2023-06-08T18:05:24Z

@ch1bo

Network issues are not the only source of non-progress

Yes, and that possibility is integral part of consensus.

maybe also because that time out of non-progress of a head could be detected & handled by the application running on top of Hydra.

I am not sure that this will be good approach for API. I think such timeout is dependent on internal details on consensus and Hydra may implement different strategies on that depending on some internal information.

That requires Hydra client to have some state & timers, while it can be completely stateless otherwise. Also in such case you should somehow state it in docs, for clients to know that they should handle such cases.
Timeouts before calling Halting for such client will depend tx time for on L1 and their finality criteria used by Hydra. And how many tx on L1 is performed too, which may change with Hydra releases. So client may need to change such timeouts at least for different networks and Hydra releases.
You handle L1 rollbacks. I think that affects timeouts as well, while client does not know if rollbacks happened.
Hydra consensus may change in future changing this behavior as well.

The upside client-side Halting is that one could easily change such behavior. But you can achieve same with option to disable Hydra-side halting and/or to provide message if Hydra server thinks consensus reached timeout (and thus recommends client to perform Halting).

ch1bo · 2023-06-12T09:16:06Z

@uhbif19 Bullets 2 and 3 indicate you are thinking about this not only on the "Hydra network" level. Which is fine itself, the things you mention all have an influence in liveness one way or another.

However, within this feature, we want to take one concrete step into improving the situation by re-submitting Hydra network messages - that is, the L2 protocol for reaching consensus in a Hydra head. This already requires grooming & planning and some open questions still remain (see "to be discussed" in original post)

uhbif19 · 2023-06-13T00:32:55Z

@ch1bo Yes, of course, restricting scope is important, you right. I just wanted to record my thoughts on this, not necessarily as part of this issue.

ch1bo · 2023-08-30T09:32:42Z

We have drawn up a pull-based workflow in a sequence diagram today (@pgrange please provide more context from your write-up)

sequenceDiagram
    Alice->>A: broadcast msg1
    Alice->>Alice: msg1
    Alice->>A: broadcast msg2
    Alice->>Alice: msg2

    Note over B: start B network stack
    B-->>A: connect
    note left of A: after seeing any message
    A->>Alice: PeerConnected

    note over A: concurrently
    A-->>B: connect
    note over A: readIndex B == 1
    A->>B: Send msg1
    A->>B: Send msg2

    B->>Bob: callback msg1
    B->>A: Ack msg1
    A->>A: readIndex B = 2
    Bob->>Bob: protocol logic
    note over B: crashes

    note over A: detects connection down (how?)

    A-->>B: connect
    note over A: readIndex B == 2
    A->>+B: Send msg2
    B->>Bob: callback msg2
    B->>-A: Ack msg2
    A->>A: readIndex B = 3
    Bob->>Bob: protocol logic

abailly-iohk · 2023-08-30T14:07:03Z

https://hackmd.io/c/tutorials/%2Fs%2FMathJax-and-UML#UML-Diagrams provides UML diagrams, include sequence diagrams. I propose we use such a document to collaborate on the design of this networking protocol.

pgrange · 2023-08-31T12:01:43Z

Here is a draft PR with a protocol specification proposal to comment: #1050

ch1bo · 2023-09-27T10:51:18Z

IMO this feature should cover this scenario to be of value to our users: #1074 (review) That is:

Start the demo devnet with 3 participants and open a head
Submit a transaction via carol -> See SnapshotConfirmed #1
Stop node of alice
Submit another transaction via carol -> See only TxValid
Start node of alice -> expect it to catch up and result in SnapshotConfirmed #2
Submitting yet another transaction by carol -> Should see SnapshotConfirmed #3

abailly-iohk · 2023-10-03T12:50:08Z

Removing #1080 as part of this issue

ch1bo added 💬 feature A feature on our roadmap network amber ⚠️ Medium complexity or partly unclear feature and removed network labels Jan 30, 2022

ch1bo changed the title ~~Re-establish off-chain connections~~ Resubmit network messages Mar 22, 2022

ch1bo changed the title ~~Resubmit network messages~~ Hydra network failure resilience May 31, 2022

ch1bo mentioned this issue Oct 4, 2022

Backup & restore the state of a Hydra Head #187

Closed

3 tasks

ch1bo mentioned this issue Feb 3, 2023

Head stuck after some network outages #612

Closed

ch1bo added this to the Mainnet milestone Feb 9, 2023

pgrange removed this from the Mainnet milestone Feb 21, 2023

ch1bo mentioned this issue Jul 13, 2023

End-to-end test of head liveness #697

Closed

ch1bo assigned pgrange Sep 6, 2023

ch1bo assigned v0d1ch Sep 18, 2023

pgrange mentioned this issue Sep 19, 2023

Draft of the network protocol #1050

Closed

4 tasks

pgrange assigned abailly-iohk Sep 19, 2023

This was referenced Sep 19, 2023

Network resilient to node's crash-recovery failure mode #1079

Closed

Introduce Reliability network layer #1074

Merged

pgrange mentioned this issue Sep 19, 2023

Adr/network reliability #1082

Merged

4 tasks

ch1bo added the L2 Affect off-chain part of the Head protocol/network label Sep 19, 2023

ch1bo changed the title ~~Hydra network failure resilience~~ Hydra network failure resilience to disconnects Sep 19, 2023

ch1bo changed the title ~~Hydra network failure resilience to disconnects~~ Network resilience to disconnects Sep 20, 2023

ch1bo added this to the 0.13.0 milestone Sep 20, 2023

abailly-iohk modified the milestones: 0.13.0, 0.14.0 Oct 3, 2023

abailly-iohk closed this as completed Oct 9, 2023

abailly-iohk mentioned this issue Oct 10, 2023

Provide tests covering the resilience of a Hydra Head cluster #1106

Closed

ch1bo unassigned pgrange, v0d1ch and abailly-iohk Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network resilience to disconnects #188

Network resilience to disconnects #188

ch1bo commented Jan 30, 2022 •

edited by abailly-iohk

Loading

ch1bo commented May 31, 2022

ch1bo commented Mar 13, 2023

uhbif19 commented May 26, 2023

ch1bo commented May 26, 2023

uhbif19 commented May 28, 2023

ch1bo commented May 28, 2023

uhbif19 commented Jun 8, 2023

ch1bo commented Jun 12, 2023 •

edited

Loading

uhbif19 commented Jun 13, 2023

ch1bo commented Aug 30, 2023

abailly-iohk commented Aug 30, 2023

pgrange commented Aug 31, 2023

ch1bo commented Sep 27, 2023 •

edited

Loading

abailly-iohk commented Oct 3, 2023

Network resilience to disconnects #188

Network resilience to disconnects #188

Comments

ch1bo commented Jan 30, 2022 • edited by abailly-iohk Loading

Why

What

How

Non goals

ch1bo commented May 31, 2022

ch1bo commented Mar 13, 2023

uhbif19 commented May 26, 2023

ch1bo commented May 26, 2023

uhbif19 commented May 28, 2023

ch1bo commented May 28, 2023

uhbif19 commented Jun 8, 2023

ch1bo commented Jun 12, 2023 • edited Loading

uhbif19 commented Jun 13, 2023

ch1bo commented Aug 30, 2023

abailly-iohk commented Aug 30, 2023

pgrange commented Aug 31, 2023

ch1bo commented Sep 27, 2023 • edited Loading

abailly-iohk commented Oct 3, 2023

ch1bo commented Jan 30, 2022 •

edited by abailly-iohk

Loading

ch1bo commented Jun 12, 2023 •

edited

Loading

ch1bo commented Sep 27, 2023 •

edited

Loading