Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eth/catalyst: move block commit into its own go-routine to avoid deadlock #29657

Closed
wants to merge 6 commits into from

Conversation

jwasinger
Copy link
Contributor

Closes #29475 . The provided test-case captures the issue and fails without the fix.

@jwasinger jwasinger requested a review from gballet as a code owner April 26, 2024 00:53
@holiman
Copy link
Contributor

holiman commented Apr 26, 2024

Looking into this lockup a bit, I was wondering why the sealBlock -> forkChoiceUpdated -> pool.Sync() was happening, and I found this comment

		// If the beacon chain is ran by a simulator, then transaction insertion,
		// block insertion and block production will happen without any timing
		// delay between them. This will cause flaky simulator executions due to
		// the transaction pool running its internal reset operation on a back-
		// ground thread. To avoid the racey behavior - in simulator mode - the
		// pool will be explicitly blocked on its reset before continuing to the
		// block production below.
		if simulatorMode {
			if err := api.eth.TxPool().Sync(); err != nil {
				log.Error("Failed to sync transaction pool", "err", err)
				return valid(nil), engine.InvalidPayloadAttributes.With(err)
			}
		}

Seems to me that adding the spinoff goroutines in this PR is basically a hack around this particular thing that is intentionally put there. I may be wrong, haven't fully grokked the entire situation yet.

@jwasinger
Copy link
Contributor Author

I'm not sure there's a good way around this. It is clear that we need the call to pool.Sync when calling fcu in simulator mode because we want the tests to be deterministic. However, this will inevitably trigger a NewTxsEvent when concurrently sending transactions in zero-period mode, causing the deadlock.

Afaict, the two options we have here are:

  1. institute this "hack" and keep the code footprint small.
    or
  2. Remove the call to pool.Sync in fcu and change every test case which uses the simulated beacon to manually call pool.Sync after inserting txs.

@fjl
Copy link
Contributor

fjl commented Jun 25, 2024

@jwasinger please look at this again

@jwasinger
Copy link
Contributor Author

jwasinger commented Jun 25, 2024

Spent the day looking into this and trying to come up with some solution that doesn't involve manually calling txpool Sync in calls to fcu for dev mode zero-period, and realized that it seems unworkable without making large changes elsewhere in the codebase.

Also, I looked at trying to create some way to intercept the NewTxsEvent into a separate channel than the one read in api.loop, and feeding this notification back through to the channel read in api.loop (in some magical non-blocking way which involves a separate go-routine). I couldn't come up with a solution that I was sure is deadlock-free and has no possibility of leaving executable transactions/withdrawals dangling without being included.

Right now, the solution in this PR is the best I can think of.

I would like to improve it further by moving the txpool.Sync invocation outside of fcu and within dev-mode specific commit logic. I think this would help improve the readability of fcu with no apparent downside.

sub = a.sim.eth.TxPool().SubscribeTransactions(newTxs, true)
newTxs = make(chan core.NewTxsEvent)
sub = a.sim.eth.TxPool().SubscribeTransactions(newTxs, true)
commitMu = sync.Mutex{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fyi, the canonical way to initialize a variable with its zero value is just to declare it, i.e. you can write

var (
     commitMu sync.Mutex
)

Initializing with a zero literal is bad style.

}
go func() {
commitMu.Lock()
defer commitMu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it matters, but this construction with the lock here will not preserve the receive order of select clauses. If both <-newTxs and <-a.sim.withdrawals.pending become ready to receive around the same time, they will both spawn new goroutines in quick succession. These goroutines will then begin to race for the lock, and it isn't guaranteed that the goroutine that got spawned first will also get the lock first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine because eventually each go-routine will unblock and commit, making sure every pending new txs/withdrawals notification will trigger a commit.

@jwasinger
Copy link
Contributor Author

jwasinger commented Jun 28, 2024

There is a problem with the logic (that is also present in master): We only call Commit once when responding to newTxs. This could potentially leave executable txs dangling if they couldn't all be included in a block, and no new txs subsequently become executable.

I looked into fixing this by altering the commit logic to commit in a loop, until the miner returns an empty payload. However, this is nontrivial because at the point where we are building a payload, we have already read withdrawals from their channel, and we have to somehow re-include them.

Tbh, I'm not sure why the spamming test case included in this PR doesn't trigger this... Will try to brainstorm how to fix the issue more today.

@fjl
Copy link
Contributor

fjl commented Jun 28, 2024

Yeah sounds like something that should be fixed.

@jwasinger
Copy link
Contributor Author

closing in favor of #30264

@jwasinger jwasinger closed this Aug 9, 2024
lightclient added a commit that referenced this pull request Aug 19, 2024
…30264)

closes #29475, replaces #29657, #30104 

Fixes two issues. First is a deadlock where the txpool attempts to reorg, but can't complete because there are no readers left for the new txs subscription. Second, resolves a problem with on demand mode where txs may be left pending when there are more pending txs than block space.

Co-authored-by: Martin Holst Swende <martin@swende.se>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transactions stuck in pending in dev-mode
3 participants