-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Validator can exceed expected number of mmaps #19320
Comments
cc @ryoqun |
cc @jeffwashington just in case (i think shrinker is the problem. i'm cc-ing because you're more up-to-date to accountsdb in general recently) |
how/is/could this be this related to the issue with shrink holding mem maps open that was fixed very late in 1.7? |
@ryoqun please put your thinking about why it is shrink here so I can leverage your experience. ;-) |
@jeffwashington this happens on 1.6 as well which doesn’t have those changes. |
The rate seems to be much slower on 1.6 than 1.7. 1.6 took a few weeks to become a problem, 1.7 is pretty obvious in a couple days |
a metric that shows the problem. |
@jeffwashington thanks for taking a look. here's some brain dump: bad (from mainnet-beta): number of appendvec continue to increase (only node restart resets the appendvec to bare minimal) good (from testnet): number of appendvec is clearly capped: |
things to try:
|
for this instance of leak bug, fortunately restarts drastically reduces the number of appendvec. so, there should be a code, which is doing correct thing. so differential analysis might be shorter path. also be careful not to introduce more dangerous bank (account) hash mismatch error.. |
also, this is the original pr, which I wrote: |
Here are 2 pubkeys that show up multiple times in a snapshot I got:
|
@lijunwangs and I are continuing to gather data and consider this issue. |
I am doing dv3qDFk1DTF36Z62bNvrCXe9sKATA6xvVy6A798xxAS - v1.7.10 --accounts-shrink-optimize-total-space false |
I think this test might reproduce the leak:
AccountsDB only knows about ~427,000 + 2000 stores:
But the process maps grows.. after 4 million slots, it's up to 500k+ stores.
|
Actually, I didn't have caching enabled, so sometimes when it would create multiple stores in the slot. With caching enabled and no shrink, now the number of stores is stable. |
I have created a draft PR for the issues discussed: |
I did see my GCE validator with --shrink-optimize-total-space set to true's num_snapshot_storage increases at a little faster pace than the v1.7.10 validator --accounts-shrink-optimize-total-space false. I also saw during the window I was testing with --accounts-shrink-optimize-total-space false -- the num_snapshot_storage just kept increasing. |
Problem
Seen on 1.6 and 1.7, the validator encounters a period where mmaps exceed the expected values which is around 430k
May take on the order of weeks of running to reproduce.
Proposed Solution
Debug leak and fix
The text was updated successfully, but these errors were encountered: