-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network resilience to disconnects #188
Comments
We have not experienced this much (might be more relevant if we would have UDP transport) and have no concrete user requests for this -> prioritize lower, still aiming 1.0.0 but maybe not. |
See #612 for an investigation of this loss of liveness. |
Why? Cannot you just establish timeout policy on the matter, and Abort Head after consensus unreachable long enough? |
I think we don't count closing the head as "covering network failures". |
But that seems like the only option to react to network failure long enogh. Could that matter be covered by other issue when? |
Of courses could this be covered by another issue, it's just that this issue was not intended to be a out this case. I don't think we have a current item for this though.. maybe also because that time out of non-progress of a head could be detected & handled by the application running on top of Hydra. Network issues are not the only source of non-progress |
Yes, and that possibility is integral part of consensus.
I am not sure that this will be good approach for API. I think such timeout is dependent on internal details on consensus and Hydra may implement different strategies on that depending on some internal information.
The upside client-side Halting is that one could easily change such behavior. But you can achieve same with option to disable Hydra-side halting and/or to provide message if Hydra server thinks consensus reached timeout (and thus recommends client to perform Halting). |
@uhbif19 Bullets 2 and 3 indicate you are thinking about this not only on the "Hydra network" level. Which is fine itself, the things you mention all have an influence in liveness one way or another. However, within this feature, we want to take one concrete step into improving the situation by re-submitting Hydra network messages - that is, the L2 protocol for reaching consensus in a Hydra head. This already requires grooming & planning and some open questions still remain (see "to be discussed" in original post) |
@ch1bo Yes, of course, restricting scope is important, you right. I just wanted to record my thoughts on this, not necessarily as part of this issue. |
We have drawn up a pull-based workflow in a sequence diagram today (@pgrange please provide more context from your write-up) sequenceDiagram
Alice->>A: broadcast msg1
Alice->>Alice: msg1
Alice->>A: broadcast msg2
Alice->>Alice: msg2
Note over B: start B network stack
B-->>A: connect
note left of A: after seeing any message
A->>Alice: PeerConnected
note over A: concurrently
A-->>B: connect
note over A: readIndex B == 1
A->>B: Send msg1
A->>B: Send msg2
B->>Bob: callback msg1
B->>A: Ack msg1
A->>A: readIndex B = 2
Bob->>Bob: protocol logic
note over B: crashes
note over A: detects connection down (how?)
A-->>B: connect
note over A: readIndex B == 2
A->>+B: Send msg2
B->>Bob: callback msg2
B->>-A: Ack msg2
A->>A: readIndex B = 3
Bob->>Bob: protocol logic
|
https://hackmd.io/c/tutorials/%2Fs%2FMathJax-and-UML#UML-Diagrams provides UML diagrams, include sequence diagrams. I propose we use such a document to collaborate on the design of this networking protocol. |
Here is a draft PR with a protocol specification proposal to comment: #1050 |
IMO this feature should cover this scenario to be of value to our users: #1074 (review) That is:
|
Removing #1080 as part of this issue |
Why
The Hydra Head becomes stuck very easily which is bad for user experience. The state machine in the
HeadLogic
does not make progress as it is waiting from some "signal" from peers, or from the chain. This can happen from a wide variety of reasons: When the connection between twohydra-node
s breaks down, or when one node crashes and restarts.When the Head stalls because of missing responses, it currently needs to be closed & re-opened to continue operation.
This issue specifically wants to address one source of "stalling", namely the transient network partitioning between peers.
What
What kind of resilience do we expect:
How
This is a large feature and therefore we want to split it in several deliveries:
Non goals
PeerConnected/Disconnected
messages we send themWait
outcomes)?Network
layer without touching theHeadLogic
. In the case of crash-recovery, theHeadLogic
will come back at the same state it was before, and the only concern is about "in-flight" network messages that might have been lostThe text was updated successfully, but these errors were encountered: