feat:networking: (Synchronous) Consistent Broadcast for Filecoin EC #9858

adlrocha · 2022-12-13T09:44:01Z

Related Issues

Implementation of filecoin-project/FIPs#527

Proposed Changes

This PR implements a synchronous consistent broadcast to detect potential consensus equivocations.

It implements a consistent broadcast that delays the delivery of blocks to the syncer to detect when a set of equivocated blocks are trying to be pushed for certain height.
The cache and this mechanisism is triggered when a new block is received through the network (pubsub subscription).

Checklist

Before you mark the PR ready for review, please make sure that:

Commits have a clear commit message.
PR title is in the form of of <PR type>: <area>: <change being made>
- example: fix: mempool: Introduce a cache for valid signatures
- PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
- area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
New features have usage guidelines and / or documentation updates in
- Lotus Documentation
- Discussion Tutorials
Tests exist for new functionality or change in behavior
CI is green

jennijuju · 2022-12-19T03:52:23Z

Oh whoa I didn’t even notice this until now! Thank you Alfonso for getting this in! @filecoin-project/lotus-tse can one of you throw this branch on your miner and confirm win posts stability? (AFTER WINTER BREAK)

chain/sub/incoming.go

magik6k · 2023-01-04T14:05:58Z

itests/kit/init.go

+	// NOTE: If we want the payment channel itests to pass that use a
+	// block time of 5*millisecond, we need to set the consistent broadcast
+	// delay to something lower than that block time.
+	// todo: configure such a low delay only for paych tests.
+	build.CBDeliveryDelay = 2 * time.Millisecond


Can we fully disable this for most tests? wdpost tests for example use 2ms block time, and it would be good to not slow them down even more.

I completely disabled consistent broadcast for itests. That being said, it would be great if we can at least enable it on a smoke consensus or deals itests with a decent block time (if there is any) just as a sanity-check that this doesn't break anything in the future.

magik6k

(meant to comment, not approve)

arajasek

Broadly LGTM, a few questions / suggestions.

chain/sync_test.go

chain/sub/bcast/consistent.go

chain/sub/incoming.go

arajasek · 2023-03-15T19:12:24Z

chain/sub/bcast/consistent.go

+type blksInfo struct {
+	ctx    context.Context
+	cancel context.CancelFunc
+	blks   []cid.Cid


I'm unsure about the thread-safety of this blks slice. We seem to be reading & modifying it somewhat unsafely, and while it's probably the case that the system won't ever race here, I'm...not sure.

Can we reason through this a bit?

I am using a sync.Map to prevent parallel read/writes and avoid requiring a specific lock for every blksInfo which may be harder to handle. I assume blksInfo as a single unit of data, and never read the data structure of blocks directly. Thus the load, store methods.

That being said, I may have missed some subtle detail. Let me know if this makes sense.

So my concern is that we load the blksInfo out of the sync.Map, but then read and modify the blks field. I think if we have 2 calls to RcvBlock, they both load the same bInfo object (safely). After that, however, they're gonna be concurrently accessing the blks slice, which I don't think is safe (eg. iterating over the blks field, while it's also being modified).

I'm not 100% sure that I'm right, and even if so, I'm not sure whether simultaneous calls to RcvBlock are possible (need to 👀 more), but does that make sense?

Actually, looking at the code closer, we don't actually ever modify it. So we're probably okay here.

I think we are OK on that, but I think you are right that there may be a race here as one routine may load the dict and another one may load it in parallel before the previous one has been able to store any potential changes.

I think this is fixed by moving this lock after the load statement in line 125. WDYT?

@adlrocha I don't think that's sufficient. I think it needs to be held until no more writing happens, which is the end of the method (as currently written). Consider the following situation:

2 blocks A and B come in with the same VRFProof.

A gets the lock first, creates an empty BCastDict (line 119), doesn't find a match for the common key on line 125, drops the lock

B gets the lock, loads the empty BCastDict created by A, also doesn't find a match for the common key on line 125, drops the lock

A stores blkACid as the only entry corresponding to key on line 141, returns

B overwrites that, storing blkBCid as the only entry corresponding to key on line 141, returns

Both A and B succeed their respective calls to WaitForDelivery

chain/sub/bcast/consistent.go

magik6k · 2023-03-16T11:24:55Z

build/params_mainnet.go

+// CBDeliveryDelay is the delay before deliver in the synchronous consistent broadcast.
+// This determines the wait time for the detection of potential equivocations.
+var CBDeliveryDelay = 6 * time.Second


Does this impact block propagation? I guess no since pubsub checks don't touch this logic.

Can we set this to a lower value while still getting the expected security benefits?

My concern is that if we merge this now, without fixes for market cron we'll push quite deep into danger territory on how long it takes to get through an epoch (propagation delay + cb delay + state compute, which can be 14s today + winning post)

Does this impact block propagation? I guess no since pubsub checks don't touch this logic.

It shouldn't, right? The only thing we are delaying is the delivery of the block to the syncer (i.e. the call to InformNewBlock), but this is done after the Gossipsub validation so the propagation of the block should follow its course seamlessly.

Can we set this to a lower value while still getting the expected security benefits?

This is a great point, and this could definitely be a problem. These 6 seconds is the theoretical time that it takes for a block to be propagated to all of the network by Gossipsub. The assumption is that after this time there is high probability that any equivocation has been propagated enough to be caught by a majority of miners. Lowering this value will impact the effectiveness of the scheme, but I don't know to what extent or what is the minimum threshold. I know @jsoares and @guy.goren were working on getting some real data from mainnet with ProbeLab to see empirically what is the optimal delay. I'll let them chime in in case they have more info.

Does this impact block propagation? I guess no since pubsub checks don't touch this logic.

As @adlrocha said, propagation shouldn't be affected. Anything else would negate the benefits of the patch: we want the guard time to be larger than the propagation time, which is easier the faster propagation is.

Can we set this to a lower value while still getting the expected security benefits?

The value can be set lower. In fact, there is no "cliff" behaviour and the security gains can be approximated (if not formally, at least intuitively) as a logistic function centred on the average propagation delay: while you don't necessarily need to cover the 99th-percentile, too low thresholds would catch few attacks and you'd always want to at least cover the average case.

Alas, our only real-world data was collected from the bootstrapper nodes (via Sentinel). Our analysis of that data shows that the vast majority of blocks reach the bootstrappers in < 1s. However, there are a few weaknesses here, primarily as (i) we're trusting source timestamps; (ii) bootstrapper nodes are not necessarily representative of the average node in terms of connectivity (both link & centrality, even mesh configuration).

In the meantime, ProbeLab is working on a network-wide GossipSub study that will yield a higher-confidence distribution for block propagation. Alas, that work is taking longer than we expected and there is still no time frame for the results. The plan was, therefore, to start with the 6s used as the assumption in the GossipSub security analysis, keep it user-configurable, and push better defaults once we have the data. I expect we'll land in the 1 to 2 second range, but don't hold me to that.

Note that here I'm more concerned about a block arriving late making SPs be late with their own block.

Propagation delay is 10s, state compute today takes up to 14s due to cron issues. With this patch worst case if some set of SPs choose to send blocks late, we'd have essentially no time spare for WindowPoSt, essentially cutting into propagation delay for the next epoch - which afaict would be either comparable or worse than epoch boundary

Right, that's a valid concern given we're right at the threshold. How high would you be comfortable going? Given the context above, I think we could conceivably adjust down a second or two to leave some buffer time -- and still reassess when the study results are available.

Co-authored-by: Aayush Rajasekaran <arajasek94@gmail.com>

arajasek · 2023-03-17T19:03:49Z

So I think I share @magik6k's concern here -- this is unfortunate, but the good news is that we're expecting things to get better fairy soon.

What would we think about shipping this with an artificially low value (2 seconds?), and running that until nv19. That gives time for any CRITICAL issues with the code to emerge (not that I expect any). Upon nv19, we can set this to the desired 6 seconds.

We also probably want to lower the default propagation delay when this is set to 6 seconds, I think?

jsoares · 2023-03-17T23:00:24Z

So I think I share @magik6k's concern here -- this is unfortunate, but the good news is that we're expecting things to get better fairy soon.

What would we think about shipping this with an artificially low value (2 seconds?), and running that until nv19. That gives time for any CRITICAL issues with the code to emerge (not that I expect any). Upon nv19, we can set this to the desired 6 seconds.

We also probably want to lower the default propagation delay when this is set to 6 seconds, I think?

(note: I'm responding to this but I'm not the author of the FIP...)

That seems reasonable to me. I'm not sure it will be that far off the final numbers.

As for adjusting the cutoff, I think the propagation study will definitely provide good data to inform this. That said:

The two things are somewhat independent, as long as they fit within the time available. PropagationDelaySecs sets the cutoff for accepting blocks, whereas CBDeliveryDelay defines how long we'll wait to potentially reject a block received by the cutoff. It does not extend the acceptance window. (also, both of these things have pretty bad names, as they describe bounds rather than delays)
Setting the cutoff too low can have a noticeable negative impact, whereas setting the guard time too low will just decrease its effectiveness. So we probably want to be more conservative with respect to the former.

magik6k · 2023-03-20T09:47:28Z

It does not extend the acceptance window

That's correct, but in the current implementation it extends the latest time at which block validation begins. Ideally we'd start validating those blocks "optimistically", and somehow retro-reject them if we see another within CBDeliveryDelay

jsoares · 2023-03-20T09:52:13Z

Ideally we'd start validating those blocks "optimistically", and somehow retro-reject them if we see another within CBDeliveryDelay

Indeed, and that's suggested in the FIP as a potential optimisation. It's unclear to me how much of a gain that represents in practice, and it may add a bit of complexity, but it's a possible direction if we are indeed time-constrained. My expectation is still that the final parameters will be very comfortable wrt to the available window.

chain/sub/bcast/consistent.go

adlrocha · 2023-03-28T08:37:00Z

@arajasek, all comments addressed. I changed the locking strategy to make it less fine-grained but easier to reason about (as discussed above). We could recover the fine-grained locking and make it more understandable by using https://github.com/gammazero/keymutex, but I didn't want to introduce any additional dependency.

(For context, we used this dependency extensively for storetheindex, and is written and maintained by Andrew).

build/params_mainnet.go

chain/sub/bcast/consistent.go

Co-authored-by: Aayush Rajasekaran <arajasek94@gmail.com>

…ast" This reverts commit 8b2208f, reversing changes made to 2db6b12. Unfortunately, we've found multiple issues in this code that could lead to unbounded resource usage, block censorship, crashes, etc. Luckily, this code isn't actually needed by the main Filecoin chain, which relies on reporting consensus faults. So we can just try again later.

…ast" This reverts commit 8b2208f, reversing changes made to 2db6b12. Unfortunately, this is rather tricky code. We've found several issues so far and, while we've fixed a few, there are outstanding issues that would require complex fixes we don't have time to tackle right now. Luckily, this code isn't actually needed by the main Filecoin chain which relies on consensus fault reporting to handle equivocation. So we can just try again later.

Revert #9858 (consistent broadcast changes)

adlrocha added 17 commits September 12, 2022 19:03

wip: draft impl of consistent bcast

c03ad9d

minor fixes. unit tests added

76031d7

added garbage collection test. minor fix

1dadff3

fixed bugs in consistent broadcast integration

d1a4f1d

fixed unit tests

72c80a3

minor changes

0209052

added testground test case for epoch boundary attack

749d133

index new testcase

8f48631

minor fixes. sync mining.

ebafb65

minor fix in composition

35f8abc

sync mining and gossipsub attacker depdendency

bf65418

Merge branch 'master' into adlrocha/consistent-bcast

894da94

consistent broadcast delay as build param

91bd679

remove error from rcvBlock. type/docs gen

f2cc452

fix race in cb cache

939e515

set small cb delivery delay for paych itests

d574d04

disabling cb delivery delay for sync tests

3988cc9

adlrocha marked this pull request as ready for review December 13, 2022 14:04

adlrocha requested a review from a team as a code owner December 13, 2022 14:04

jsoares mentioned this pull request Dec 15, 2022

FRC-0051: Create FIP: Adding (Synchronous) Consistent Broadcast to EC filecoin-project/FIPs#527

Merged

magik6k approved these changes Jan 4, 2023

View reviewed changes

magik6k reviewed Jan 4, 2023

View reviewed changes

magik6k self-requested a review January 4, 2023 14:09

jennijuju added this to the v1.21.0 milestone Feb 23, 2023

address review

c11ffa5

arajasek reviewed Mar 15, 2023

View reviewed changes

magik6k reviewed Mar 16, 2023

View reviewed changes

chain/sub/bcast/consistent.go Show resolved Hide resolved

magik6k reviewed Mar 16, 2023

View reviewed changes

Update chain/sub/bcast/consistent.go

f59c246

Co-authored-by: Aayush Rajasekaran <arajasek94@gmail.com>

adlrocha added 2 commits March 16, 2023 17:03

address review

8d260d7

minor fix

90c2f9d

set CB delay to 2 secs

fa7e1ef

arajasek reviewed Mar 21, 2023

View reviewed changes

chain/sub/bcast/consistent.go Show resolved Hide resolved

adlrocha added 2 commits March 28, 2023 10:01

global locking strategy for blockInfo map

92f6d3e

add comments

df82a82

arajasek reviewed Mar 28, 2023

View reviewed changes

build/params_mainnet.go Show resolved Hide resolved

adlrocha added 4 commits March 28, 2023 16:52

include a deeper gc round

1a771e4

Merge branch 'master' into adlrocha/consistent-bcast

b7c297c

add CB param to all testnet builds

f24fc83

return CBDeliveryDelay into a var

103d786

arajasek approved these changes Mar 29, 2023

View reviewed changes

chain/sub/bcast/consistent.go Outdated Show resolved Hide resolved

Update chain/sub/bcast/consistent.go

682ddf6

Co-authored-by: Aayush Rajasekaran <arajasek94@gmail.com>

arajasek merged commit 8b2208f into filecoin-project:master Mar 30, 2023

arajasek mentioned this pull request Apr 3, 2023

fix: build: add CBDeliveryDelay to testground #10613

Merged

7 tasks

arajasek added a commit that referenced this pull request May 2, 2023

Merge pull request #10777 from filecoin-project/steb/revert-cb

76f2316

Revert #9858 (consistent broadcast changes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat:networking: (Synchronous) Consistent Broadcast for Filecoin EC #9858

feat:networking: (Synchronous) Consistent Broadcast for Filecoin EC #9858

adlrocha commented Dec 13, 2022 •

edited

Loading

jennijuju commented Dec 19, 2022

magik6k Jan 4, 2023

adlrocha Mar 10, 2023

magik6k left a comment

arajasek left a comment

arajasek Mar 15, 2023

adlrocha Mar 16, 2023

arajasek Mar 20, 2023

arajasek Mar 20, 2023

adlrocha Mar 20, 2023

arajasek Mar 21, 2023

magik6k Mar 16, 2023

adlrocha Mar 16, 2023

jsoares Mar 16, 2023

magik6k Mar 16, 2023

jsoares Mar 17, 2023

arajasek commented Mar 17, 2023

jsoares commented Mar 17, 2023

magik6k commented Mar 20, 2023

jsoares commented Mar 20, 2023

adlrocha commented Mar 28, 2023

feat:networking: (Synchronous) Consistent Broadcast for Filecoin EC #9858

feat:networking: (Synchronous) Consistent Broadcast for Filecoin EC #9858

Conversation

adlrocha commented Dec 13, 2022 • edited Loading

Related Issues

Proposed Changes

Checklist

jennijuju commented Dec 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

magik6k left a comment

Choose a reason for hiding this comment

arajasek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arajasek commented Mar 17, 2023

jsoares commented Mar 17, 2023

magik6k commented Mar 20, 2023

jsoares commented Mar 20, 2023

adlrocha commented Mar 28, 2023

adlrocha commented Dec 13, 2022 •

edited

Loading