Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add leak detection to Store #121482

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Feb 1, 2025

When working on ES-10641, I noticed one category of slow shut down is due to the only possible relocation failing repeatedly, which is in turn caused by the the shard lock being held on the target node.
The message on the shard lock is closing shard which indicates the lock was once held by a Store which never released the shard lock.

We recently made a change to dump hot threads when shard creation was prevented by a lingering shard lock, and in instances of the above error occurring you can now see there is no active thread doing anything related to closing the shard, so I suspect there may be a store reference leaking somewhere, preventing the final release of the shard lock.

This PR adds leak tracking to the Store's refCounter in the hope that we might observe the leak in CI.

@nicktindall nicktindall added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Feb 3, 2025
@nicktindall nicktindall marked this pull request as ready for review February 3, 2025 23:45
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Feb 3, 2025
ShardLock shardLock,
OnClose onClose,
boolean hasIndexSort,
boolean detectLeaks
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the need to be able to turn leak detection off, but there were a few places where we don't get access to the store to manage its lifecycle correctly.

Store.OnClose.EMPTY,
false,
false
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one such case, alternatively we could add it to a collection to be closed upon @After.

@@ -566,8 +566,7 @@ private void testDirectories(
}

final Store store = new Store(shardId, indexSettings, directory, new DummyShardLock(shardId));
store.incRef();
releasables.add(store::decRef);
releasables.add(store::close);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why the lifecycle was being managed like above, and whether I've fundamentally changed something here. It seems like we should have a single reference from creation, so we don't need to incRef?

@nicktindall nicktindall requested a review from ywangd February 3, 2025 23:52
@nicktindall nicktindall changed the title WIP Add leak detection to Store Add leak detection to Store Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants