-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
capi: Fix batch size control #2308
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done like you suggest before, but was changed in order to fix OOM crashes on low RAM VMs. The current_batch_state_size check doesn't guarantee that the next iteration execution overflows memory. When it fails to execute due to a potential overflow, it produces an error. The condition for that happens multiple layers inside, so I've decided to use an exception and not result/error codes.
I think it is likely my fault that the logic is broken here. I agree that on MemoryLimitError it should reinsert the block back to the block_buffer, because the block was not executed, and needs to be executed again (for block_buffer.pop_back to return it again).
For silkworm_execute_blocks_ephemeral this is not a problem, because prefetched_blocks.pop_front is done only for successful executions.
Alternatively consider to align silkworm_execute_blocks_perpetual logic to do the same and pop only on success.
I am split on this, so let's pick @canepat brain. To summarize: The current logic is:
I see two problems with this: we use exceptions to control program flow and some blocks must be executed twice. Even if we decide to pop the block only on success, we still face these two issues. The original approach was:
The problem is that we find out if the block exceeds the limit only after the execution. In some cases, this can cause out-of-memory exceptions. There is no easy way to detect if a block exceeds the limit before the execution. Some potential fixes:
|
Yes, the main reason of not doing it the old way (apriori estimation) is that the estimation function must know how many things are going to be updated/inserted, and this only becomes known AFTER it executes. You are saying that this won't be a problem for erigon 3. In this case in my opinion we should not spend too much time on this, and do minimal changes to fix the logic. So I'm trying to understand your points better and evaluate in terms of work scope:
Regarding the exception usage, we have considered using an extended enum for embedding this error into the execution result, but at the end decided not to do it because it lead to extra complexity. Nevertheless, it wouldn't change the need to reinsert the block. If it bothers you to call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
silkworm_capi_test hangs on CI though |
@battlmonstr issue fixed, please review again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
The problem with the current approach is that it catches
db::Buffer::MemoryLimitError
too late, when the block has already been popped from theblock_buffer
. Then in the next iteration we would get the next block and the assertSILKWORM_ASSERT(block->header.number == block_number)
would fail. For this to work we need to re-insert the failed block in the exception handler.I think the better approach is to control the batch size with the
state_buffer.current_batch_state_size() <= max_batch_size
check. Additionally, using exceptions for control flow is generally discouraged.I changed the unit test to cover for this scenario. Also I have tested the change by synchronizing the full Sepolia chain data.