Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICS3 handshake version negotiation problem #459

Closed
adizere opened this issue Jul 30, 2020 · 15 comments · Fixed by #479
Closed

ICS3 handshake version negotiation problem #459

adizere opened this issue Jul 30, 2020 · 15 comments · Fixed by #479
Assignees

Comments

@adizere
Copy link
Contributor

adizere commented Jul 30, 2020

Overview

It may be possible that the connection handshake protocol aborts at a chain during step connOpenTry if this chain previously executed step connOpenInit.

Execution example

Imagine the following scenario for establishing a connection between chains Alice and Bob, consisting of 4 steps:

  1. chain Alice executes connOpenInit

    • Alice stores a ConnectionEnd object where the field version <- X (X is a list of versions that Alice supports, obtained from getCompatibleVersions())
    • the other fields are not relevant
  2. chain Bob executes connOpenInit

    • Bob stores a ConnectionEnd object where the field version <- Y (Y is the list of versions that Bob supports, obtained from getCompatibleVersions() on Bob's side)
    • suppose this is for the same connection that Alice executed at step 1 above
  3. a correct relayer constructs a connOpenTry datagram to further progress in the connection handshake

    • assume that the destination of this datagram is Bob
    • the field counterpartyVersions in this datagram consists of list X (again, this is the list of versions that Alice supports)
  4. chain Bob delivers and executes connOpenTry

    • recall that Bob already stores a ConnectionEnd with version === Y (step 2 above)
    • Bob will abort execution of connOpenTry handler because the following check (marked with ~~>) fails:
abortTransactionUnless(
      (previous === null) ||
      (previous.state === INIT &&
        previous.counterpartyConnectionIdentifier === counterpartyConnectionIdentifier &&
        previous.counterpartyPrefix === counterpartyPrefix &&
        previous.clientIdentifier === clientIdentifier &&
        previous.counterpartyClientIdentifier === counterpartyClientIdentifier &&
   ~~>  previous.version === version))

Further clarifications:

Bob aborts execution of connOpenTry only if X =/= Y, i.e., only if the two lists of versions X and Y returned from getCompatibleVersions() are different at the two chains. If we assume that the two chains always return the same list of versions, then this handler should not abort, so there should not be a problem.

Alice would also abort execution if she executes the connOpenTry handler by the same reasoning as above: Alice stores a ConnectionEnd object in the variable previous with version === X and she is comparing this against Y.

Fix

It seems that removing the verification previous.version === version would fix this problem.

@cwgoes
Copy link
Contributor

cwgoes commented Jul 30, 2020

I see. Nice catch. I think this check is problematic anyways, since pickVersion(counterpartyVersions) is supposed to pick a single version, as opposed to the multiple versions which are initially stored in connOpenInit(). Probably this was originally written because all other fields (besides version) are supposed to be immutable once connOpenInit() is called - this is what allows e.g. optimistic packet sends to happen safely.

We can remove the previous.version === version check without compromising the safety of the connection handshake, that should be fine, but we need to think about how optimistic connection sends (sending a packet before a connection handshake has completed) will work when version is no longer fixed by connOpenInit() (although it really wasn't ever supposed to be anyways) - either:

  • We give up that feature and fix version only at the end of the connection handshake (optimistic channel sends are still OK)
  • We implement an alternative handshake path which fixes the version at connOpenInit and allows optimistic connection sends

@adizere
Copy link
Contributor Author

adizere commented Aug 4, 2020

After further thoughts, the problem here seems more subtle..

In the current specification, due to the previous.version === version verification (in __ connOpenTry__), the handshake protocol has a liveness problem: If both chains execute connOpenInit, then the protocol will never terminate (as described above in this issue). Without this verification, however, a safety problem appears. We can think of this problem as a "split-brain," where the two chains both finish the connection handshake protocol and obtain a OPEN ConnectionEnd each, but the ends of the connection are inconsistent in terms of their version field. The execution below sketches this problem.

  1. chain Alice executes connOpenInit

    • Alice stores a ConnectionEnd object where the field version <- [x] (a list with one element x)
    • The state of the connection is INIT at Alice
  2. chain Bob executes connOpenInit

    • Bob stores a ConnectionEnd object where the field version <- [y] (a list with one element y)
    • The state of the connection is INIT at Bob
  3. suppose two correct relayers exist:

    • relayer 1 constructs a connOpenTry datagram to further progress in the connection, and the destination of this datagram is Bob; the field counterpartyVersions in this datagram consists of list [x]
    • relayer 2 constructs a connOpenTry datagram, and the destination of this datagram is Alice; correspondingly, the field counterpartyVersions in this datagram consists of list [y]
  4. chain Bob executes connOpenTry

    • The connection goes into state TRYOPEN at Bob
    • The version in this connection is x (the result of version = pickVersion([x]))
  5. chain Alice executes connOpenTry

    • The connection goes into state TRYOPEN at Alice
    • The version in this connection is y (the result of version = pickVersion([y]))
  6. back at the two correct relayers:

    • relayer 1 constructs a connOpenAck datagram, with destination Bob; the field version in this datagram is y (the state of chain Alice)
    • relayer 2 constructs a connOpenAck datagram for destination Alice; the field version is x
  7. Bob executes connOpenAck

    • All verifications pass and the connection goes into state OPEN at Bob
    • The version of this connection, according to Bob is y
  8. Alice executes connOpenAck

    • All verifications pass and the connection goes into state OPEN at Alice
    • The version of this connection, according to Alice is x

Note that connOpenConfirm never needs to execute at any party.

The basic problem is the symmetry of this protocol. It would be great if your suggestion above would fix it, but I'm not sure:

either:

  • We give up that feature and fix version only at the end of the connection handshake (optimistic channel sends are still OK)
  • We implement an alternative handshake path which fixes the version at connOpenInit and allows optimistic connection sends

A principled fix would be to break this symmetry, preferably as early as possible in the handshake steps. For instance only one of the parties would be allowed to execute connOpenInit_ (may be based on simple lexicographical ordering). If only Alice can run connOpenInit, then no problem should ever arise. But I suspect this is not an option (?). A simpler fix (a hack) is to negotiate versions using an append-only set and then order those versions deterministically at each chain. If this sounds interesting, we can flesh out the exact details, but I would rather be in favor of the former solution.

@cwgoes
Copy link
Contributor

cwgoes commented Aug 5, 2020

The basic problem is the symmetry of this protocol.

Yes, I agree. Per a separate discussion with Agoric we plan to revise the handshake so that the initiator can choose whether or not to fix the version and remoteIdentifier fields at the start of the handshake - #462 - this should allow for both optimistic handshakes & dynamic negotiation, just not necessarily at the same time.

This doesn't address the split-brain problem you describe above, however.

The only information we could rely on for some sort of asymmetry (e.g. lexicographical ordering) is the client state and/or identifiers, since the clients must already exist. Something along this line might be possible.

Alternatively, we could add a fifth handshake datagram, which would allow confirming the version before marking the connection OPEN.

(thinking some more...)

@cwgoes
Copy link
Contributor

cwgoes commented Aug 10, 2020

Ruminating on this ... the issue with adding an extra step to the handshake is that it will just result in a hung handshake (no progress) if versions are disagreed upon, since there is no way to reconcile them...

@ancazamfir
Copy link
Contributor

if I understand correctly, the problem occurs when the local and counterparty versions intersection has more than one version. And, assuming a pickVersion() implementation by preference, the preference for this intersection is different on A and B.
For example, if A's supported versions are [y, x, w] while B's [z, x, y]. The intersection is {x, y} but A will pick y since its preference order is [y, x] while B picks x (preference order [x, y]).

In general it seems that we need to specify a deterministic pickVersion() function in the ICS.
e.g. if there is more than one common version and preference order is different, then pick the highest lexicographic one, otherwise the highest preference one.

Ruminating on this ... the issue with adding an extra step to the handshake is that it will just result in a hung handshake (no progress) if versions are disagreed upon, since there is no way to reconcile them...

could have a pickVersions() returning a list of versions and the extra step would reconcile (e.g. between [y, x] on A with [x, y] on B in the example^)

@cwgoes
Copy link
Contributor

cwgoes commented Aug 11, 2020

Yes; I think that would work, we just need to be careful that the relayer can't choose a different version, so both sides would need to enforce that pickVersion() was computed correctly. Good idea.

@cwgoes
Copy link
Contributor

cwgoes commented Aug 11, 2020

Maybe we can get this without an extra handshake step.

  1. (OpenInit) Initiating chain stores its list of acceptable versions
  2. (OpenTry) Receiving chain verifies initiating chain's list, stores its own list of acceptable versions
  3. (OpenAck) Initiating chain picks unique version from intersection using known ranking function
  4. (OpenConfirm) Receiving chain picks unique version from intersection using same ranking function

Both lists are verified, and if the same pickVersion function is used (3) and (4) should end up with the same version.

@cwgoes
Copy link
Contributor

cwgoes commented Aug 11, 2020

... also, we should clearly articulate the symmetry of the handshake, ensure that it holds, and verify this property, it is quite useful.

@ancazamfir
Copy link
Contributor

(OpenInit) Initiating chain stores its list of acceptable versions
(OpenTry) Receiving chain verifies initiating chain's list, stores its own list of acceptable versions
(OpenAck) Initiating chain picks unique version from intersection using known ranking function
(OpenConfirm) Receiving chain picks unique version from intersection using same ranking function

In the symmetric case there is no OpenConfirm. If one chain doesn't implement the standard pickVersion I believe we can still end up with opened connection with different versions.

Why can't we do the "chain picks unique version from intersection..." on OpenTry (similar to how it's specified today I believe), something like:

  1. (OpenInit) Initiating chain stores its list of acceptable versions.
  2. (OpenTry) Receiving chain picks unique version from intersection using known ranking function. This version is stored in the connectionEnd.
  3. (OpenAck) Initiating chain verifies that the counterparty version is included in its list of acceptable versions. This version is stored in the connectionEnd. If not, connection stays in INIT or TRYOPEN state.
  4. (OpenConfirm) Receiving chain verifies the counterparty version in the message is the same as the one stored in its connectionEnd. If not, connection stays in TRYOPEN state.

... also, we should clearly articulate the symmetry of the handshake, ensure that it holds, and verify this property, it is quite useful.

agreed, hopefully we get a model for this (checking different solutions) soon.

@cwgoes
Copy link
Contributor

cwgoes commented Aug 13, 2020

In the symmetric case there is no OpenConfirm. If one chain doesn't implement the standard pickVersion I believe we can still end up with opened connection with different versions.

Yes, but that's fine because the second OpenAck behaves the same way (calls pickVersion, and the two lists of versions are necessarily the same because they were cross-verified in the first two steps).

Why can't we do the "chain picks unique version from intersection..." on OpenTry (similar to how it's specified today I believe), something like:

Hmm, yes, that might work as well, we need to ensure that if both sides run steps 2 and 3 they still end up with the same version, but that should be possible to ensure by constraining the ranking function, I think. I think this logic is similar to what I proposed except that mutual cross-verification of version lists doesn't need to happen in the expected case, only the crossing-hellos case - so this is a bit more efficient, although slightly trickier to reason about.

agreed, hopefully we get a model for this (checking different solutions) soon.

💯

@adizere
Copy link
Contributor Author

adizere commented Aug 25, 2020

I like both solutions!

Hmm, yes, that might work as well, we need to ensure that if both sides run steps 2 and 3 they still end up with the same version, but that should be possible to ensure by constraining the ranking function, I think. I think this logic is similar to what I proposed except that mutual cross-verification of version lists doesn't need to happen in the expected case, only the crossing-hellos case - so this is a bit more efficient, although slightly trickier to reason about.

Agree. One logic is slightly simpler to understand, the other more to the point. Hopefully model checking will give us more certainty about their correctness.

(OpenAck) Initiating chain verifies that the counterparty version is included in its list of acceptable versions. This version is stored in the connectionEnd. If not, connection stays in INIT or TRYOPEN state.

Regarding this last part "If not, connection stays in INIT or TRYOPEN state," it will be useful to clarify that if versions do not intersect, then connection handshake hangs -- it's impossible to ensure liveness. So one of the assumptions of ICS3 that we should state clearly is that the sets of versions that the two handshaking chains provide (return values of getCompatibleVersions) should have a non-empty intersection; this is something to be added to ICS3#versioning.

@adizere adizere changed the title ICS3 potential problem in connOpenTry ICS3 handshake version negotiation problem Aug 28, 2020
@adizere
Copy link
Contributor Author

adizere commented Aug 28, 2020

PR ibc-rs/#211 documents our exploration of the solution candidates to fix this problem. In particular:

  • the solution candidates are described in this readme and
  • the disclosure log now includes traces for all the problems we discussed, including for the solution candidates that are incorrect.

Next steps:

  1. discuss (sync and async) PR ibc-rs/#211;
  2. decide on a solution: either of the candidates "onTryDet" or "onAckDet" are correct, but the former is probably a more appropriate fix for an implementation -- to be discussed;
  3. open PR in cosmos/ics to fix ICS3 algorithms,
  4. perhaps close the loop by revisiting the TLA+ specs to ensure alignment with the fixed ICS3 algorithms.

@adizere
Copy link
Contributor Author

adizere commented Sep 1, 2020

... also, we should clearly articulate the symmetry of the handshake, ensure that it holds, and verify this property, it is quite useful. (@cwgoes)

The most promising formulation of protocol symmetry is as follows:

In the connection handshake, any party may attempt to take a protocol step (i.e., process a datagram from a relayer) based on any -- possibly outdated -- state of the other party.

Not sure how clear this is. But it should capture the intuition that:

  • given absolutely any state that party Alice is in at some point,
  • a relayer may construct a datagram on that state, and
  • party Bob may process that datagram (regardless of Bob's current state)
  • (and vice versa for direction Bob -> Alice)

@cwgoes
Copy link
Contributor

cwgoes commented Sep 1, 2020

In the connection handshake, any party may attempt to take a protocol step (i.e., process a datagram from a relayer) based on any -- possibly outdated -- state of the other party.

Interesting - this is not the sort of symmetry I had in mind, but it probably is a useful notion!

The symmetry I had in mind is a symmetry between chains A and B, such that subsequent steps of a handshake "mirror" each other and A and B each perform the same computations and verify the same data on the other side - this is not a required property (and it is violated by "onTryDet", which is probably the handshake option we want to enact), but it is nice because it makes things simpler to reason about. For example, the version picking procedure in "onAckDet" is symmetric, since both chains verify the set of supported versions on the other chain and then run pickVersion - so we only have to reason about one "logical execution flow", as it were - but in this case, "onAckDet" has the disadvantage of resulting in potentially disagreeing versions if pickVersion is incorrectly implemented, which is (arguably) worse than the handshake failing as it would with "onTryDet".

@adizere
Copy link
Contributor Author

adizere commented Sep 2, 2020

The symmetry I had in mind is a symmetry between chains A and B, such that subsequent steps of a handshake "mirror" each other and A and B each perform the same computations and verify the same data on the other side - this is not a required property .. but it is nice because it makes things simpler to reason about.

I see, neat! Not sure how to reconcile the fact that this symmetry property is "useful" and yet "violated" (by the "onTryDet" mode). The property I introduced above is satisfied by "onAckDet" and "onTryDet" but not as useful as this one... I guess we just leave this issue about symmetry in the backlog for the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

3 participants