-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add leak detection to Store #121482
base: main
Are you sure you want to change the base?
Add leak detection to Store #121482
Conversation
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
ShardLock shardLock, | ||
OnClose onClose, | ||
boolean hasIndexSort, | ||
boolean detectLeaks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the need to be able to turn leak detection off, but there were a few places where we don't get access to the store to manage its lifecycle correctly.
Store.OnClose.EMPTY, | ||
false, | ||
false | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one such case, alternatively we could add it to a collection to be closed upon @After
.
@@ -566,8 +566,7 @@ private void testDirectories( | |||
} | |||
|
|||
final Store store = new Store(shardId, indexSettings, directory, new DummyShardLock(shardId)); | |||
store.incRef(); | |||
releasables.add(store::decRef); | |||
releasables.add(store::close); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why the lifecycle was being managed like above, and whether I've fundamentally changed something here. It seems like we should have a single reference from creation, so we don't need to incRef
?
When working on ES-10641, I noticed one category of slow shut down is due to the only possible relocation failing repeatedly, which is in turn caused by the the shard lock being held on the target node.
The message on the shard lock is
closing shard
which indicates the lock was once held by aStore
which never released the shard lock.We recently made a change to dump hot threads when shard creation was prevented by a lingering shard lock, and in instances of the above error occurring you can now see there is no active thread doing anything related to closing the shard, so I suspect there may be a store reference leaking somewhere, preventing the final release of the shard lock.
This PR adds leak tracking to the
Store
's refCounter in the hope that we might observe the leak in CI.