-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid taking unique lock when starting lookup request #5059
Labels
area/docdb
YugabyteDB core features
Comments
spolitov
added a commit
that referenced
this issue
Jul 15, 2020
Summary: Currently when meta cache does not have entry for partition lookup request will acquire 2 locks. Initially it will acquire shared lock, check cache, then acquire unique lock and add callback to waiting list. If we have a lot of read/write requests and master communication issues it could result in hundreds of threads waiting on meta cache mutex. This state is very similar to deadlock, since processing response also requires taking the same lock. This diff changes waiting list to lock free multi producers single consumer list. So we could add callback while holding shared lock, and process response while holding unique lock. It helps to rule out scenarios when we have a lot of requests but master cannot serve lookup requests yet. So we would not get into jammed state and start processing as soon as master becomes responsive. Test Plan: 1) Launch heavy workload with async client, so number of concurrent request could be really high. 2) After block cache is filled start additional async workload, so it will require a lot of new lookups. 3) After 1 minute, restart master leader. 4) After 1 minute, restart one tserver. 5) After this tserver complete boostrapping, restart tserver with maximum amount of leaders. Check that cluster recover after it. Reviewers: rsami, timur Reviewed By: rsami, timur Subscribers: bogdan, rsami, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8892
spolitov
added a commit
that referenced
this issue
Jul 21, 2020
…okup request Summary: Currently when meta cache does not have entry for partition lookup request will acquire 2 locks. Initially it will acquire shared lock, check cache, then acquire unique lock and add callback to waiting list. If we have a lot of read/write requests and master communication issues it could result in hundreds of threads waiting on meta cache mutex. This state is very similar to deadlock, since processing response also requires taking the same lock. This diff changes waiting list to lock free multi producers single consumer list. So we could add callback while holding shared lock, and process response while holding unique lock. It helps to rule out scenarios when we have a lot of requests but master cannot serve lookup requests yet. So we would not get into jammed state and start processing as soon as master becomes responsive. Test Plan: 1) Launch heavy workload with async client, so number of concurrent request could be really high. 2) After block cache is filled start additional async workload, so it will require a lot of new lookups. 3) After 1 minute, restart master leader. 4) After 1 minute, restart one tserver. 5) After this tserver complete boostrapping, restart tserver with maximum amount of leaders. Check that cluster recover after it. Jenkins: patch: 2.2.0 Reviewers: rsami, timur Reviewed By: timur Subscribers: ybase, rsami, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8951
spolitov
added a commit
that referenced
this issue
Jul 22, 2020
…okup request Summary: Currently when meta cache does not have entry for partition lookup request will acquire 2 locks. Initially it will acquire shared lock, check cache, then acquire unique lock and add callback to waiting list. If we have a lot of read/write requests and master communication issues it could result in hundreds of threads waiting on meta cache mutex. This state is very similar to deadlock, since processing response also requires taking the same lock. This diff changes waiting list to lock free multi producers single consumer list. So we could add callback while holding shared lock, and process response while holding unique lock. It helps to rule out scenarios when we have a lot of requests but master cannot serve lookup requests yet. So we would not get into jammed state and start processing as soon as master becomes responsive. Test Plan: 1) Launch heavy workload with async client, so number of concurrent request could be really high. 2) After block cache is filled start additional async workload, so it will require a lot of new lookups. 3) After 1 minute, restart master leader. 4) After 1 minute, restart one tserver. 5) After this tserver complete boostrapping, restart tserver with maximum amount of leaders. Check that cluster recover after it. Jenkins: patch: 2.1.8 Reviewers: rsami, timur Reviewed By: timur Subscribers: ybase, rsami, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8952
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
No description provided.
The text was updated successfully, but these errors were encountered: