Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid taking unique lock when starting lookup request #5059

Closed
spolitov opened this issue Jul 13, 2020 · 0 comments
Closed

Avoid taking unique lock when starting lookup request #5059

spolitov opened this issue Jul 13, 2020 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features

Comments

@spolitov
Copy link
Contributor

No description provided.

@spolitov spolitov added the area/docdb YugabyteDB core features label Jul 13, 2020
@spolitov spolitov self-assigned this Jul 13, 2020
spolitov added a commit that referenced this issue Jul 15, 2020
Summary:
Currently when meta cache does not have entry for partition lookup request will acquire 2 locks.
Initially it will acquire shared lock, check cache, then acquire unique lock and add callback to waiting list.

If we have a lot of read/write requests and master communication issues it could result in hundreds of threads waiting on meta cache mutex.
This state is very similar to deadlock, since processing response also requires taking the same lock.

This diff changes waiting list to lock free multi producers single consumer list.
So we could add callback while holding shared lock, and process response while holding unique lock.
It helps to rule out scenarios when we have a lot of requests but master cannot serve lookup requests yet.
So we would not get into jammed state and start processing as soon as master becomes responsive.

Test Plan:
1) Launch heavy workload with async client, so number of concurrent request could be really high.
2) After block cache is filled start additional async workload, so it will require a lot of new lookups.
3) After 1 minute, restart master leader.
4) After 1 minute, restart one tserver.
5) After this tserver complete boostrapping, restart tserver with maximum amount of leaders.

Check that cluster recover after it.

Reviewers: rsami, timur

Reviewed By: rsami, timur

Subscribers: bogdan, rsami, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8892
spolitov added a commit that referenced this issue Jul 21, 2020
…okup request

Summary:
Currently when meta cache does not have entry for partition lookup request will acquire 2 locks.
Initially it will acquire shared lock, check cache, then acquire unique lock and add callback to waiting list.

If we have a lot of read/write requests and master communication issues it could result in hundreds of threads waiting on meta cache mutex.
This state is very similar to deadlock, since processing response also requires taking the same lock.

This diff changes waiting list to lock free multi producers single consumer list.
So we could add callback while holding shared lock, and process response while holding unique lock.
It helps to rule out scenarios when we have a lot of requests but master cannot serve lookup requests yet.
So we would not get into jammed state and start processing as soon as master becomes responsive.

Test Plan:
1) Launch heavy workload with async client, so number of concurrent request could be really high.
2) After block cache is filled start additional async workload, so it will require a lot of new lookups.
3) After 1 minute, restart master leader.
4) After 1 minute, restart one tserver.
5) After this tserver complete boostrapping, restart tserver with maximum amount of leaders.

Check that cluster recover after it.

Jenkins: patch: 2.2.0

Reviewers: rsami, timur

Reviewed By: timur

Subscribers: ybase, rsami, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8951
spolitov added a commit that referenced this issue Jul 22, 2020
…okup request

Summary:
Currently when meta cache does not have entry for partition lookup request will acquire 2 locks.
Initially it will acquire shared lock, check cache, then acquire unique lock and add callback to waiting list.

If we have a lot of read/write requests and master communication issues it could result in hundreds of threads waiting on meta cache mutex.
This state is very similar to deadlock, since processing response also requires taking the same lock.

This diff changes waiting list to lock free multi producers single consumer list.
So we could add callback while holding shared lock, and process response while holding unique lock.
It helps to rule out scenarios when we have a lot of requests but master cannot serve lookup requests yet.
So we would not get into jammed state and start processing as soon as master becomes responsive.

Test Plan:
1) Launch heavy workload with async client, so number of concurrent request could be really high.
2) After block cache is filled start additional async workload, so it will require a lot of new lookups.
3) After 1 minute, restart master leader.
4) After 1 minute, restart one tserver.
5) After this tserver complete boostrapping, restart tserver with maximum amount of leaders.

Check that cluster recover after it.

Jenkins: patch: 2.1.8

Reviewers: rsami, timur

Reviewed By: timur

Subscribers: ybase, rsami, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8952
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features
Projects
None yet
Development

No branches or pull requests

1 participant