Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eth: reset miner, engine, and txpool when calling SetHead to prevent crashes #16713

Closed
wants to merge 1 commit into from

Conversation

albrow
Copy link

@albrow albrow commented May 8, 2018

This small change fixes a bug in tx_pool.go which causes geth to crash in certain cases when debug.setHead is used.

I discovered this bug while running a single-node, private PoA network for testing purposes. Here are the steps I took to consistently produce the crash:

  1. Start a single-node, private PoA network (I used puppeth and followed this guide as a starting point). Be sure to enable the RPC API with the flag: --rpcapi 'miner,debug'.
  2. Mine at least one block. If you need to start the miner, you can use the "miner_start" endpoint of the JSON RPC API.
  3. Stop the miner via the "miner_stop" endpoint.
  4. Set the head to a previous block number via the "debug_setHead" endpoint.
  5. Start the miner again.
  6. Geth crashes with a stack trace that looks like the following:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x447559f]

goroutine 75 [running]:
github.com/ethereum/go-ethereum/core/types.(*Block).NumberU64(...)
	/Users/alex/programming/go/src/github.com/ethereum/go-ethereum/core/types/block.go:310
github.com/ethereum/go-ethereum/core.(*TxPool).reset(0xc420241dc0, 0xc420578b40, 0xc420578d80)
	/Users/alex/programming/go/src/github.com/ethereum/go-ethereum/core/tx_pool.go:370 +0x6bf
github.com/ethereum/go-ethereum/core.(*TxPool).loop(0xc420241dc0)
	/Users/alex/programming/go/src/github.com/ethereum/go-ethereum/core/tx_pool.go:289 +0x504
created by github.com/ethereum/go-ethereum/core.NewTxPool
	/Users/alex/programming/go/src/github.com/ethereum/go-ethereum/core/tx_pool.go:253 +0x4ed

I believe this is happening because when the ChainHeadEvent is received in tx_pool.go, the variable head refers to a block that no longer exists. We can detect this cause by checking if the block number of head is greater than the current block number. And if that is the case, we can fix it by simply setting head to the current block.

After making this change, I tested it manually using the steps described above and observed that geth no longer crashes and continues to mine starting from the new block number. I also ran make test to ensure that all the existing tests still pass (they do). However, there still could be some unintended side effects I'm not aware of.

If we would like to add a test case for this, I would be happy to take a stab at it and include it in this PR. I just need some advice on where the test should live and how to get started on writing it.

core/tx_pool.go Outdated
@@ -286,6 +286,12 @@ func (pool *TxPool) loop() {
if pool.chainconfig.IsHomestead(ev.Block.Number()) {
pool.homestead = true
}
if pool.chain.CurrentBlock().NumberU64() < head.NumberU64() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to use ev.Block instead of pool.chain.CurrentBlock()? It's faster and should be the same thing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I just didn't realize they were the same. Updated.

@GitCop
Copy link

GitCop commented May 10, 2018

Thank you for your contribution! Your commits seem to not adhere to the repository coding standards

  • Commit: 63e6250dde40a46da3506c9150228339400b22dc
  • Commits must be prefixed with the package(s) they modify

Please check the contribution guidelines for more details.


This message was auto-generated by https://gitcop.com

@albrow
Copy link
Author

albrow commented May 14, 2018

Any additional feedback here?

It looks like the CI tests are failing due a timeout unrelated to my changes.

@albrow albrow changed the title core: fix a crash in tx_pool.go when debug.setHead is used eth: reset miner, engine, and txpool when calling SetHead to prevent crashes May 15, 2018
@albrow
Copy link
Author

albrow commented May 15, 2018

After running more of our smart contract tests against my fork, I realized Geth was still crashing in some cases after calling debug.setHead. Even stranger, I observed some cases where both ev.Block and pool.chain.CurrentBlock() had block numbers that were too high and did not actually exist in the chain, which resulted in nil pointer dereferences. In addition, after calling debug.setHead the miner often either logged an error and stopped or simply hung waiting for new transactions (depending on how the network is configured and whether it's using Ethash or Clique for consensus).

Rather than trying to track down every place in the code where we were holding onto some pointers to blocks, I decided to opt for a different approach. This new PR modifies EthAPIBackend.SetHead to stop and re-initialize the miner, txpool, and consensus engine. This guarantees that we don't run into any problems with those parts of the codebase holding onto old state, but at the cost of making SetHead slower and computationally more expensive.

I've updated my commit message and PR title to reflect the fact that the changes now lie in the "eth" package instead of the "core" package.

I'm not sure if this covers everything that could be affected by SetHead, or if there is a more efficient way to do this. I do know that with these changes, Geth is no longer crashing or stalling out for our use case. I look forward to hearing your feedback, and I'm happy to make further changes as needed.

@holiman
Copy link
Contributor

holiman commented May 25, 2018

On the one hand, this looks clean and the setHead operation is in the area "Use at your own risk, may break things", so I'm inclined to accept this PR if it works for you.

However, I'm a bit curious what happens with all existing channels and go-routines coupled to the old txpool and miner that you simply discard. I'm thinking that maybe that should be cleaned up, otherwise things could potentially get very weird.

Could you try to e..g call setHead 100 times in a row, check what happens?

@albrow
Copy link
Author

albrow commented May 29, 2018

@holiman Our tests call setHead about 200 times and I am able to run them multiple times in a row without any apparent issues. I also have been monitoring memory usage and it does not appear to increase (though it's possible the increase is too small to detect with htop due to rounding).

I assumed calling b.eth.miner.Stop() and b.eth.txPool.Stop() would clean up any resources like goroutines and channels. Is that not the case?

b.eth.blockchain.SetHead(number)
b.eth.txPool = core.NewTxPool(b.eth.config.TxPool, b.eth.chainConfig, b.eth.blockchain)
if b.eth.chainConfig.Clique != nil {
Copy link
Author

@albrow albrow May 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At my company, we use a single node network for testing with Clique as the consensus engine. I'm not sure how to handle Ethash or other consensus engines here. Is that something we need to consider?

@albrow
Copy link
Author

albrow commented Jan 22, 2019

@holiman or @karalabe I believe I've answered your questions and responded to your comments. Do you have any additional feedback here?

For the past 6-7 months at 0x, we've been running tests against our fork of Geth which includes these changes. We haven't experienced any major issues. Our test suite is extensive and hits the debug_setHead endpoint ~1,000 times for each test run.

We have reached the point where we have some other developers in the 0x ecosystem using our fork of Geth and relying on the debug_setHead feature (as well as the debug_increaseTime feature introduced in #16834). If at all possible, we would like to merge these changes upstream to make it easier for everyone.

Edit: To clarify, the original issue that I wanted to fix with this PR appears to still exist (at least in some form; I'm not sure if there are still cases where Geth crashes or if it only hangs now). If we try running our test suite against the latest release (as of this comment, v1.8.21), Geth hangs/does not mine any new blocks and the tests are unable to complete.

@holiman
Copy link
Contributor

holiman commented Jan 24, 2019

Commen from review discussion:

setHead should ideally be identical to a deep reorg, so it would be better if the sethead functionality used the same notification mechanisms / channels as are triggered during a reorg. If some component, like txpool, doesn't handle such a reorg properly, then that needs to be fixed. Makes sense?

@albrow
Copy link
Author

albrow commented Jan 29, 2019

@holiman Thank you for the response. I agree with what you're saying in theory. Before I made any changes, setHead was pretty much implemented as a chain re-org in the way you described. However, as I have shown, this leads to crashes/hangs in other parts of the code that don't expect sudden changes to the chain.

I don't necessarily have the knowledge of Geth internals required to debug and fix issues with the chain re-org approach. This is especially true when you have complicated interactions between different parts of the codebase (e.g., as I described above ev.Block and pool.chain.CurrentBlock() having block numbers that are too high, which results in nil pointer exceptions when you try to lookup the corresponding block in the chain).

Consider this: This PR addresses a specific, reproducible bug with setHead. The current version of setHead as implemented in the master branch does not work. The changes in this PR are isolated to a small part of the code (one single file) and will be easy to remove in the future in case someone comes up with a better fix. The fix in this PR does not result in any resource leaks as far as I can tell, and the only real downside is a reduction in performance (I can't speak for everyone, but for our test suite the performance impact is negligible). Would you consider merging this as a stopgap measure?

Otherwise, if you really don't want to merge this PR as is, I will try my best to address your concerns and implement setHead using the existing chain re-org mechanisms instead of restarting everything. This would take some additional time and means I would have to address the issues in tx_pool.go and any other place where there are potentially outdated pointers to blocks. Could you point me to some examples where one part of the codebase gets notified of a chain re-org and handles it appropriately? Any additional info you could share to help me better understand what is going wrong?

@albrow
Copy link
Author

albrow commented Feb 26, 2019

To clarify on the current status of this PR: It is currently not working due to merge conflicts/upstream changes. It will take some significant work for me to rebase and fix any issues, so I'm waiting for answers to my questions and some consensus on the approach before continuing. I'm happy to continue working on this PR. Just need some feedback first.

@karalabe
Copy link
Member

karalabe commented Mar 7, 2019

Potentially fixed by #19216 in a cleaner way. That one reproduces the issue more deterministically, we might be able to fix this in the txpool fully isolated.

@adamschmideg
Copy link
Contributor

We're trying to reproduce it and create an issue based on that.

@holiman
Copy link
Contributor

holiman commented Mar 26, 2019

Superseded by #19308

@albrow
Copy link
Author

albrow commented Mar 28, 2019

@holiman @karalabe While I am happy that you were able to find a cleaner solution, it unfortunately still doesn't fix all the problems with debug_setHead. See #19346.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants