Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prospective-parachains: allow chained candidates coming out-of-order #3541

Closed
Tracked by #1829
alindima opened this issue Mar 1, 2024 · 16 comments · Fixed by #4035
Closed
Tracked by #1829

prospective-parachains: allow chained candidates coming out-of-order #3541

alindima opened this issue Mar 1, 2024 · 16 comments · Fixed by #4035
Assignees
Labels
I5-enhancement An additional feature request. T8-polkadot This PR/Issue is related to/affects the Polkadot network.

Comments

@alindima
Copy link
Contributor

alindima commented Mar 1, 2024

Assume latest backed candidate of a parachain is A.
With elastic-scaling enabled, one collator builds candidate B and another collator candidate C.

There's currently no restriction on the order in which these collations will be advertised and supplied to the soon-to-be block author. If prospective-parachains learns of candidate C before candidate B, candidate C will be discarded.

For elastic-scaling to properly work, we need to allow some number of disjoint trees to form, even if they don't build directly on top of the latest backed candidate.

The alternative is collator cooperation and hoping that network latency/availability does not defeat that.

@alindima alindima added I5-enhancement An additional feature request. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Mar 1, 2024
@sandreim
Copy link
Contributor

sandreim commented Mar 1, 2024

Sounds like it would create additional complexity in prospective-parachains or collator client. Alternatively we could buffer them in candidate-backing and batch send them once we have proper parent but not sure if this is better.

@alindima
Copy link
Contributor Author

alindima commented Mar 1, 2024

Sounds like it would create additional complexity in prospective-parachains or collator client. Alternatively we could buffer them in candidate-backing and batch send them once we have proper parent but not sure if this is better.

that's a good idea. candidate-backing is less complex than prospective-parachains, so adding this logic wouldn't hurt as bad

@burdges
Copy link

burdges commented Mar 2, 2024

With elastic-scaling enabled, one collator builds candidate B and another collator candidate C.

Although we never liked the restriction, we envisioned that initially B and C must come from the same collator.

If prospective-parachains learns of candidate C before candidate B, candidate C will be discarded.

We could've candidate C provide a merkle proof into the header of B, or the whole header of B.

If the parachain uses sufficiently parallel algorithms, then yes B and C could be built without knowledge of one another. We could consider this "tree" scaling, instead of of "elastic" scaling. An approach goes: parachains have short forks of "helper" blocks, maybe based off some common "master" parachain parent. A future "master" parachain block could reintegrate these forks by proving their state changes mesh correctly. All this feels like "future work" though.

@eskimor
Copy link
Member

eskimor commented Mar 2, 2024

Although we never liked the restriction, we envisioned that initially B and C must come from the same collator.

That does not change anything, because that same collator would still send the collations to different backing groups.

We just need to be more flexible and accept an "unconnected" fork, we just need to be careful to not open up a spam vector.

@eskimor
Copy link
Member

eskimor commented Mar 14, 2024

Given the added complexity due to elastic scaling (and also more potential spam vectors), it would be good if we could drop support for parachain forks.

Note we are not talking about relay chain forks here, it is clear that there will be a parachain fork for each relay chain fork, we are not concerned about these, only about multiple parachain forks for a single relay chain fork.

We are pretty sure that

(a) parachains have no good reason to fork
(b) that they are also not doing it in practice

(b) would be good to have verified, hence I would suggest adding a metric to prospective parachains recording the number of forks we see in the wild for parachains.

The result should be improved validator performance, less complexity, less spam vectors, at the cost of forky parachains taking a performance hit.

Thinking about this some more, actually I don't think we need metrics here:

  1. Without asynchronous backing, forks don't matter. We just pick one anyway and they are not able to build upon each other.
  2. With asynchronous backing, if we dropped the "wrong" fork, then we might invalidate a complete prospective chain that is currently being built. In that case the parachain would actually take a performance hit. But, we have not launched asynchronous backing yet and if this would be an issue, we should fix it on the parachain side and not try to waste validator resources to resolve an issue at the wrong level.

@eskimor
Copy link
Member

eskimor commented Mar 14, 2024

So by not allowing forks, we have not yet solved "unconnected" collations though. If they are not connected, they could be whatever. Nodes which are not even collators could just provide us old collations they found and we would need to believe this is an unconnected fork and accept it.

To resolve this, I would suggest for MVP to introduce the notion of extended validation:

  1. We accept unconnected collations and validate them with regards to the provided head data, but we delay distribution of statements until we received statements for the missing links up to the last included candidate.
  2. Then we distribute our statements.

The idea is, that the validation of a candidate is only considered successful once we have seen statements for its dependencies. This allows still for parallel validation of candidates, but sequences the results ... adding a bit of delay, which should not matter much with asynchronous backing.

Consequences: For second and third cores in elastic scaling there will be no PoV distribution between validators. The collator need to provide the collation to sufficient validators himself.

Reasoning

If we consider having the full chain as part of a successful validation, then elastic scaling does not worsen the situation for validators. Yes nodes can mess with us, but they could equally well just provide an invalid collation. It is arguably a easier to provide invalid resource consuming collations now, but by making the completing of the chain part of the validation the issue stays isolated on the validator and it can react to garbage candidates, by dropping reputation of the peer, especially with #616 validators will have proper means for that.

If we instead distributed statements earlier, then garbage candidates would no longer be an isolated problem, but other validators will also already have invested work which then needs to be discarded. These validators could then not meaningfully "punish" anybody.

Outlook

With proper dependency tracking we could distribute statements early, iff the dependency (CandidateHash of parent) was not only part of the descriptor, but part of the CandidateCommitments. Then the collator can simply provide us the missing descriptors and we can verify the chain - knowing it is valid, if our candidate is valid. This improves the situation dramatically as then we can be sure that the collation is fully valid and not e.g. some historic one. The only left attack is obviously to never provide those dependencies.

We could make it so that those unconnected collations can still be put into a block: They won't be enacted, but the backers could still get rewards. Thus a misbehaving collator would again only harm the parachain (and it has to be a collator as the provided collation was valid) and we can therefore again leave taking care of those nasty nodes to the Parachain.

Implementation

  1. We check for dependencies being present, we don't yet fully consider a candidate valid - no notifications will be sent. We can, for maximal streamlining, already sign the statements, but we need to buffer them until we have seen dependencies.
  2. For maximum throughput we need some notification on incoming dependencies. Could be registering a oneshot sender in prospective parachains.
  3. We need an additional timeout, similar to the normal validation: If we don't receive the dependencies in a reasonable timeframe, we drop the statements. A reasonable timeout would be in the ballpark of a slot, aka. <= 6s.
  4. Raciness: Even if we only distribute, after we have received statements of dependencies, this does not mean that all other nodes will also receive them in order. I would assume that this is handled by statement distribution, but have not checked.

@ordian
Copy link
Member

ordian commented Mar 15, 2024

This should work, if the current validation succeeds then for any actual block chain, making these intermediate head data up, should not be possible. If the chain were no actual block chain and making up that data would be possible, then the validators are still fine: We assume it can not be made up, if it can it is a liveness problem of the parachain, that can be fixed by the parachain itself (e.g. just be a proper chain).

Not sure I follow which problem we're trying to solve. Malicious collators trying to waste resources of validators? They can always add an invalid parent head data and never reveal grandparent candidate. But they could also just send an invalid candidate and create a new identify after we disconnect them. I guess it's a question of which candidates/collators do we prioritize. Not sure if sending the full chain of head data helps here though. Could you elaborate on how it helps exactly?

What about parachains that genuinely have forks, for example PoW parachains? In addition to that, do we want to support non-blockchain based parachains in the future, e.g. DAG-based "chains"?

@eskimor
Copy link
Member

eskimor commented Mar 15, 2024

Not sure I follow which problem we're trying to solve. Malicious collators trying to waste resources of validators? They can always add an invalid parent head data and never reveal grandparent candidate. But they could also just send an invalid candidate and create a new identify after we disconnect them. I guess it's a question of which candidates/collators do we prioritize. Not sure if sending the full chain of head data helps here though. Could you elaborate on how it helps exactly?

Completely reworked the comment. For the new identity #616 should help.

What about parachains that genuinely have forks, for example PoW parachains? In addition to that, do we want to support non-blockchain based parachains in the future, e.g. DAG-based "chains"?

I don't think that PoW parachains make any sense and if so, they would have poor performance anyways. All the ignorance on forks does is that it reduces the parachain performance if there are forks ... also 6s block times for a proof of work parachain seem unrealistic anyway.

We force them to be a chain ... for our intents or purposes. This does not limit flexibilty in any way. In particular for the final version, all we need is that candidates form a chain ... given that candidates include the relay parent, this is almost guaranteed just because of this requirement. ParaDags could e.g. add a sequence number to ensure this in all cases. (Candidates based on the same relay parent)

@burdges
Copy link

burdges commented Mar 15, 2024

PoW parachains could be ignored for many reasons, including them being too slow for elastic scaling, but more because they cannot really contact backing groups or do off-chain messaging in sensible ways.

We accept unconnected collations and validate them with regards to the provided head data, but we delay distribution of statements until we received statements for the missing links up to the last included candidate.

There an issue that backing groups share the candidates internally first, and only later to other validaters, so you want to avoid this internal sharing for spam, but your solution works then why not make collators do this anyways? It's worse latency with slow collators I guess?

We've a race for reporting attestations on-chain under your proposal: We've many groups that've validated candidates, but none of them know if they can gossip their candidates. We need maybe 1/4 second before each being gossiped triggers the next to be gossiped, so we'll only manage 4 elastic scalaing candidates per extra second after the validation time.

They won't be enacted, but the backers could still get rewards

We should never pay backers until finality anyways, when we'd pay approval checkers too, but I suppose internally valid candidates would be approved, even though they never fit into the parachain. That's fine.

We force them to be a chain ... for our intents or purposes

I do think there is some value in approving dangling candidates who never fit into any chain. We do not have to permit this for elastic scaling, but it really does provide a gateway for parachains where the collators have not really figured everything out themselves yet.

As I've said elsewhere, I'd favor an "asyncronous accumulation" replacement for the accumulate function in the core jam proposal. It'd just permit these dangling "helper" blocks, which then future parachain blocks could integrate into the parachain using the parachain's own logic, not bound by the execution time of code on the relay chain.

At a high level, the big trade off for dangling blocks is that we're limiting how parachains can handle their own economics: VCs, etc cannot so easily subsidise collators. It's maybe fine if elastic scaling works with VC funding, but then corejam or whatever winds up much less friendly to VC funding. I donno..

Ideas:

If headers were hashed like H(prefix_including_parent_hash, H(rest_of_header)) then we can share prefix_including_parent_hash more cheaply, without worrying about all the header bloat we've unfortunately accumulated in rest_of_header.

I've not followed core time closely enough, but can we now be sure that the relay chain gets paid for blocks which get approved but never fit into some chain? At least this must surely be doable.

@sandreim
Copy link
Contributor

Not sure I follow which problem we're trying to solve. Malicious collators trying to waste resources of validators? They can always add an invalid parent head data and never reveal grandparent candidate. But they could also just send an invalid candidate and create a new identify after we disconnect them. I guess it's a question of which candidates/collators do we prioritize. Not sure if sending the full chain of head data helps here though. Could you elaborate on how it helps exactly?

Completely reworked the comment. For the new identity #616 should help.

What about parachains that genuinely have forks, for example PoW parachains? In addition to that, do we want to support non-blockchain based parachains in the future, e.g. DAG-based "chains"?

I don't think that PoW parachains make any sense and if so, they would have poor performance anyways. All the ignorance on forks does is that it reduces the parachain performance if there are forks ... also 6s block times for a proof of work parachain seem unrealistic anyway.

We force them to be a chain ... for our intents or purposes. This does not limit flexibilty in any way. In particular for the final version, all we need is that candidates form a chain ... given that candidates include the relay parent, this is almost guaranteed just because of this requirement. ParaDags could e.g. add a sequence number to ensure this in all cases. (Candidates based on the same relay parent)

I'd like to propose something in between. Let's accept a bounded number of forks per depth but at the same time we disallow cycles and force DAGs to include some nonce. This should remove complexity and make the prospective parachains subsystem interface more straight forward, as we can't have the same candidate at multiple depths and/or relay parents. Parachains which are too forkful take a performance hit(too is something we can be lenient if there is a real need, but we should be very strict at first) and this is fine as discussed above.

That being said, to tackle the issue at hand, I'd go for a simpler approach in this lighter prospective parachains subsystem by introducing a new mapping unconnected_candidates_by_parent_head: HashMap<ParaHeadHash, HashSet<CandidateHash>. We'd prune elements in this mapping as candidates get connected into the fragment tree, or they get out of view.

WDYT ?

@eskimor
Copy link
Member

eskimor commented Mar 18, 2024

Just realized: There is no risk of random nodes messing with us with historic candidates as we are still limiting the allowed relay parents. So all valid candidates even unconnected ones must have been produced recently by an actual collator.

@alindima
Copy link
Contributor Author

Let's accept a bounded number of forks per depth but at the same time we disallow cycles and force DAGs to include some nonce.

How could we even enforce that we disallow cycles? Just ignore them and hope that they won't blast out into bricking the relay chain?
I don't think we have feasible alternatives

@sandreim
Copy link
Contributor

sandreim commented Mar 25, 2024

Yeah, IMO for the general case we really cannot, but for what is in scope of the prospective parachains at any point in time we could:

  • not allow same candidate at multiple depths
  • enforce no duplicate para heads in chained candidates we pass to runtime

Additionally it might be good to only once allow the introduction of a candidate to the fragment tree.

@eskimor
Copy link
Member

eskimor commented Mar 25, 2024

We can right now not disallow them, but we don't need to provide a guarantee that a parachain having cycles works. So we can drop (if it gains us something significant) support in that sense, we still need for validators not to crash if people do it anyway ;-)

@eskimor
Copy link
Member

eskimor commented Mar 25, 2024

Yeah, IMO for the general case we really cannot, but for what is in scope of the prospective parachains at any point in time we could:

* not allow same candidate at multiple depths

* enforce no duplicate para heads in chained candidates we pass to runtime

Additionally it might be good to only once allow the introduction of a candidate to the fragment tree.

Sounds reasonable. Just dropping duplicates for whatever we have in view seems sensible, if it makes things easier.

@alindima
Copy link
Contributor Author

alindima commented Apr 3, 2024

I have some updates from my prospective-parachains immersion.

As this issue says, we've been under the impression that, if candidate B (which builds on top of candidate A) is supplied to a backing group before the group finds out (via statement-distribution) about candidate A, candidate B will be discarded.
According to the logic in prospective-parachains, this is correct.

Based on this, I've been rewriting prospective-parachains to allow for a number of "orphan" candidates (still a draft yet).
While doing this, I simplified it by not allowing parachain forks and cycles.

I just discovered though that this is already somewhat tackled in collator-protocol.
On the validator side, before fetching an advertised collation, it asks prospective-parachains whether this new candidate can be seconded (builds on a leaf of the fragment tree).
If it doesn't, it won't fetch the collation yet. It will be buffered and retried on a leaf update or when another candidate is backed.

This fixes the problem but hurts pipelining. Candidate B will not be validated and backed until Candidate A is backed and the statements reach the other backing group.

We can therefore get on with testing on a testnet and see how big of an issue this is. In the meantime, I'll get on with the changes.

In local zombienet tests, this is not an issue yet because the network is optimal and validation times are low

@alindima alindima linked a pull request Apr 12, 2024 that will close this issue
7 tasks
github-merge-queue bot pushed a commit that referenced this issue May 13, 2024
Reworks prospective-parachains so that we allow a number of unconnected
candidates (for which we don't know the parent candidate yet). Needed
for elastic scaling:
#3541. Without this,
candidate B will not be validated and backed until candidate A (its
parent) is validated and a backing statement reaches the validator.

Due to the high complexity of the subsystem, I rewrote parts of it so
that we don't concern ourselves with candidates which form cycles or
which form parachain forks. We now have "Fragment chains" instead of
"Fragment trees". This greatly simplifies some of the code and is a
compromise we can make. We just need to make sure that cycle-producing
parachains don't brick the relay chain and that fork-producing
parachains can still make some progress (on one core at least). The only
forks that are allowed are those on the relay chain, obviously.

Unconnected candidates are kept in the `CandidateStorage` and whenever a
new candidate is introduced, we try to repopulate the chain with as many
candidates as we can.

Also fixes #3219

Guide changes will be done as part of:
#3699

TODOs:

- [x] see if we can replace the `Cow` over the candidate commitments
with an `Arc` over the entire `ProspectiveCandidate`. It's only being
overwritten in unit tests. We can work around that.
- [x] finish fragment_chain unit tests
- [x] add more prospective-parachains subsystem tests
- [x] test with zombienet what happens if a parachain is creating cycles
(it should not brick the relay chain).
- [x] test with zombienet a parachain that is creating forks. it should
keep producing blocks from time to time (one bad collator should not DOS
the parachain, even if throughput decreases)
- [x] add some more logs and metrics
- [x] add prdoc and remove the "silent" label

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Co-authored-by: Andrei Sandu <andrei-mihail@parity.io>
hitchhooker pushed a commit to ibp-network/polkadot-sdk that referenced this issue Jun 5, 2024
Reworks prospective-parachains so that we allow a number of unconnected
candidates (for which we don't know the parent candidate yet). Needed
for elastic scaling:
paritytech#3541. Without this,
candidate B will not be validated and backed until candidate A (its
parent) is validated and a backing statement reaches the validator.

Due to the high complexity of the subsystem, I rewrote parts of it so
that we don't concern ourselves with candidates which form cycles or
which form parachain forks. We now have "Fragment chains" instead of
"Fragment trees". This greatly simplifies some of the code and is a
compromise we can make. We just need to make sure that cycle-producing
parachains don't brick the relay chain and that fork-producing
parachains can still make some progress (on one core at least). The only
forks that are allowed are those on the relay chain, obviously.

Unconnected candidates are kept in the `CandidateStorage` and whenever a
new candidate is introduced, we try to repopulate the chain with as many
candidates as we can.

Also fixes paritytech#3219

Guide changes will be done as part of:
paritytech#3699

TODOs:

- [x] see if we can replace the `Cow` over the candidate commitments
with an `Arc` over the entire `ProspectiveCandidate`. It's only being
overwritten in unit tests. We can work around that.
- [x] finish fragment_chain unit tests
- [x] add more prospective-parachains subsystem tests
- [x] test with zombienet what happens if a parachain is creating cycles
(it should not brick the relay chain).
- [x] test with zombienet a parachain that is creating forks. it should
keep producing blocks from time to time (one bad collator should not DOS
the parachain, even if throughput decreases)
- [x] add some more logs and metrics
- [x] add prdoc and remove the "silent" label

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Co-authored-by: Andrei Sandu <andrei-mihail@parity.io>
liuchengxu pushed a commit to liuchengxu/polkadot-sdk that referenced this issue Jun 19, 2024
Reworks prospective-parachains so that we allow a number of unconnected
candidates (for which we don't know the parent candidate yet). Needed
for elastic scaling:
paritytech#3541. Without this,
candidate B will not be validated and backed until candidate A (its
parent) is validated and a backing statement reaches the validator.

Due to the high complexity of the subsystem, I rewrote parts of it so
that we don't concern ourselves with candidates which form cycles or
which form parachain forks. We now have "Fragment chains" instead of
"Fragment trees". This greatly simplifies some of the code and is a
compromise we can make. We just need to make sure that cycle-producing
parachains don't brick the relay chain and that fork-producing
parachains can still make some progress (on one core at least). The only
forks that are allowed are those on the relay chain, obviously.

Unconnected candidates are kept in the `CandidateStorage` and whenever a
new candidate is introduced, we try to repopulate the chain with as many
candidates as we can.

Also fixes paritytech#3219

Guide changes will be done as part of:
paritytech#3699

TODOs:

- [x] see if we can replace the `Cow` over the candidate commitments
with an `Arc` over the entire `ProspectiveCandidate`. It's only being
overwritten in unit tests. We can work around that.
- [x] finish fragment_chain unit tests
- [x] add more prospective-parachains subsystem tests
- [x] test with zombienet what happens if a parachain is creating cycles
(it should not brick the relay chain).
- [x] test with zombienet a parachain that is creating forks. it should
keep producing blocks from time to time (one bad collator should not DOS
the parachain, even if throughput decreases)
- [x] add some more logs and metrics
- [x] add prdoc and remove the "silent" label

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Co-authored-by: Andrei Sandu <andrei-mihail@parity.io>
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
Reworks prospective-parachains so that we allow a number of unconnected
candidates (for which we don't know the parent candidate yet). Needed
for elastic scaling:
paritytech#3541. Without this,
candidate B will not be validated and backed until candidate A (its
parent) is validated and a backing statement reaches the validator.

Due to the high complexity of the subsystem, I rewrote parts of it so
that we don't concern ourselves with candidates which form cycles or
which form parachain forks. We now have "Fragment chains" instead of
"Fragment trees". This greatly simplifies some of the code and is a
compromise we can make. We just need to make sure that cycle-producing
parachains don't brick the relay chain and that fork-producing
parachains can still make some progress (on one core at least). The only
forks that are allowed are those on the relay chain, obviously.

Unconnected candidates are kept in the `CandidateStorage` and whenever a
new candidate is introduced, we try to repopulate the chain with as many
candidates as we can.

Also fixes paritytech#3219

Guide changes will be done as part of:
paritytech#3699

TODOs:

- [x] see if we can replace the `Cow` over the candidate commitments
with an `Arc` over the entire `ProspectiveCandidate`. It's only being
overwritten in unit tests. We can work around that.
- [x] finish fragment_chain unit tests
- [x] add more prospective-parachains subsystem tests
- [x] test with zombienet what happens if a parachain is creating cycles
(it should not brick the relay chain).
- [x] test with zombienet a parachain that is creating forks. it should
keep producing blocks from time to time (one bad collator should not DOS
the parachain, even if throughput decreases)
- [x] add some more logs and metrics
- [x] add prdoc and remove the "silent" label

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Co-authored-by: Andrei Sandu <andrei-mihail@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I5-enhancement An additional feature request. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

5 participants