eth/catalyst: move block commit into its own go-routine to avoid deadlock #29657

jwasinger · 2024-04-26T00:53:17Z

Closes #29475 . The provided test-case captures the issue and fails without the fix.

…f transactions/withdrawals

holiman · 2024-04-26T11:09:16Z

Looking into this lockup a bit, I was wondering why the sealBlock -> forkChoiceUpdated -> pool.Sync() was happening, and I found this comment

		// If the beacon chain is ran by a simulator, then transaction insertion,
		// block insertion and block production will happen without any timing
		// delay between them. This will cause flaky simulator executions due to
		// the transaction pool running its internal reset operation on a back-
		// ground thread. To avoid the racey behavior - in simulator mode - the
		// pool will be explicitly blocked on its reset before continuing to the
		// block production below.
		if simulatorMode {
			if err := api.eth.TxPool().Sync(); err != nil {
				log.Error("Failed to sync transaction pool", "err", err)
				return valid(nil), engine.InvalidPayloadAttributes.With(err)
			}
		}

Seems to me that adding the spinoff goroutines in this PR is basically a hack around this particular thing that is intentionally put there. I may be wrong, haven't fully grokked the entire situation yet.

jwasinger · 2024-05-28T04:24:08Z

I'm not sure there's a good way around this. It is clear that we need the call to pool.Sync when calling fcu in simulator mode because we want the tests to be deterministic. However, this will inevitably trigger a NewTxsEvent when concurrently sending transactions in zero-period mode, causing the deadlock.

Afaict, the two options we have here are:

institute this "hack" and keep the code footprint small.
or
Remove the call to pool.Sync in fcu and change every test case which uses the simulated beacon to manually call pool.Sync after inserting txs.

fjl · 2024-06-25T13:03:46Z

@jwasinger please look at this again

jwasinger · 2024-06-25T23:32:48Z

Spent the day looking into this and trying to come up with some solution that doesn't involve manually calling txpool Sync in calls to fcu for dev mode zero-period, and realized that it seems unworkable without making large changes elsewhere in the codebase.

Also, I looked at trying to create some way to intercept the NewTxsEvent into a separate channel than the one read in api.loop, and feeding this notification back through to the channel read in api.loop (in some magical non-blocking way which involves a separate go-routine). I couldn't come up with a solution that I was sure is deadlock-free and has no possibility of leaving executable transactions/withdrawals dangling without being included.

Right now, the solution in this PR is the best I can think of.

I would like to improve it further by moving the txpool.Sync invocation outside of fcu and within dev-mode specific commit logic. I think this would help improve the readability of fcu with no apparent downside.

…ck sealing loop, and out of fcu

fjl · 2024-06-28T11:26:17Z

eth/catalyst/simulated_beacon_api.go

-		sub    = a.sim.eth.TxPool().SubscribeTransactions(newTxs, true)
+		newTxs   = make(chan core.NewTxsEvent)
+		sub      = a.sim.eth.TxPool().SubscribeTransactions(newTxs, true)
+		commitMu = sync.Mutex{}


Just fyi, the canonical way to initialize a variable with its zero value is just to declare it, i.e. you can write

var ( commitMu sync.Mutex )

Initializing with a zero literal is bad style.

fjl · 2024-06-28T11:30:11Z

eth/catalyst/simulated_beacon_api.go

-			}
+			go func() {
+				commitMu.Lock()
+				defer commitMu.Unlock()


Not sure if it matters, but this construction with the lock here will not preserve the receive order of select clauses. If both <-newTxs and <-a.sim.withdrawals.pending become ready to receive around the same time, they will both spawn new goroutines in quick succession. These goroutines will then begin to race for the lock, and it isn't guaranteed that the goroutine that got spawned first will also get the lock first.

It should be fine because eventually each go-routine will unblock and commit, making sure every pending new txs/withdrawals notification will trigger a commit.

jwasinger · 2024-06-28T12:48:58Z

There is a problem with the logic (that is also present in master): We only call Commit once when responding to newTxs. This could potentially leave executable txs dangling if they couldn't all be included in a block, and no new txs subsequently become executable.

I looked into fixing this by altering the commit logic to commit in a loop, until the miner returns an empty payload. However, this is nontrivial because at the point where we are building a payload, we have already read withdrawals from their channel, and we have to somehow re-include them.

Tbh, I'm not sure why the spamming test case included in this PR doesn't trigger this... Will try to brainstorm how to fix the issue more today.

fjl · 2024-06-28T12:51:57Z

Yeah sounds like something that should be fixed.

jwasinger · 2024-08-09T09:37:05Z

closing in favor of #30264

…30264) closes #29475, replaces #29657, #30104 Fixes two issues. First is a deadlock where the txpool attempts to reorg, but can't complete because there are no readers left for the new txs subscription. Second, resolves a problem with on demand mode where txs may be left pending when there are more pending txs than block space. Co-authored-by: Martin Holst Swende <martin@swende.se>

jwasinger added 2 commits April 25, 2024 17:39

eth/catalyst: add test case for on-demand dev mode that spams a lot o…

1f14c9e

…f transactions/withdrawals

eth/catalyst: fix deadlock by committing blocks in separate go-routines

d4acec3

jwasinger requested a review from gballet as a code owner April 26, 2024 00:53

jwasinger added 2 commits April 25, 2024 18:42

remove log statement

77bea8a

up test timeout to see if it passes remote CI

bd39709

fjl mentioned this pull request Jun 25, 2024

Fix deadlock in 'SimulatedBeacon.loop' #29476

Closed

fjl assigned jwasinger and unassigned jwasinger Jun 25, 2024

jwasinger requested a review from lightclient as a code owner June 28, 2024 01:29

eth/catalyst: move manual txpool.Sync invocation into zero-period blo…

a52979d

…ck sealing loop, and out of fcu

jwasinger force-pushed the dev-mode-deadlock branch from 7118ea0 to a52979d Compare June 28, 2024 02:14

fjl reviewed Jun 28, 2024

View reviewed changes

Update simulated_beacon_test.go

d0e9416

lightclient mentioned this pull request Aug 4, 2024

eth/catalyst: ensure period zero mode leaves no pending txs in pool #30264

Merged

jwasinger closed this Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eth/catalyst: move block commit into its own go-routine to avoid deadlock #29657

eth/catalyst: move block commit into its own go-routine to avoid deadlock #29657

jwasinger commented Apr 26, 2024

holiman commented Apr 26, 2024

jwasinger commented May 28, 2024

fjl commented Jun 25, 2024

jwasinger commented Jun 25, 2024 •

edited

Loading

fjl Jun 28, 2024

fjl Jun 28, 2024

jwasinger Jun 28, 2024

jwasinger commented Jun 28, 2024 •

edited

Loading

fjl commented Jun 28, 2024

jwasinger commented Aug 9, 2024

eth/catalyst: move block commit into its own go-routine to avoid deadlock #29657

eth/catalyst: move block commit into its own go-routine to avoid deadlock #29657

Conversation

jwasinger commented Apr 26, 2024

holiman commented Apr 26, 2024

jwasinger commented May 28, 2024

fjl commented Jun 25, 2024

jwasinger commented Jun 25, 2024 • edited Loading

fjl Jun 28, 2024

Choose a reason for hiding this comment

fjl Jun 28, 2024

Choose a reason for hiding this comment

jwasinger Jun 28, 2024

Choose a reason for hiding this comment

jwasinger commented Jun 28, 2024 • edited Loading

fjl commented Jun 28, 2024

jwasinger commented Aug 9, 2024

jwasinger commented Jun 25, 2024 •

edited

Loading

jwasinger commented Jun 28, 2024 •

edited

Loading