Avoid taking unique lock when starting lookup request #5059

spolitov · 2020-07-13T15:37:14Z

No description provided.

Summary: Currently when meta cache does not have entry for partition lookup request will acquire 2 locks. Initially it will acquire shared lock, check cache, then acquire unique lock and add callback to waiting list. If we have a lot of read/write requests and master communication issues it could result in hundreds of threads waiting on meta cache mutex. This state is very similar to deadlock, since processing response also requires taking the same lock. This diff changes waiting list to lock free multi producers single consumer list. So we could add callback while holding shared lock, and process response while holding unique lock. It helps to rule out scenarios when we have a lot of requests but master cannot serve lookup requests yet. So we would not get into jammed state and start processing as soon as master becomes responsive. Test Plan: 1) Launch heavy workload with async client, so number of concurrent request could be really high. 2) After block cache is filled start additional async workload, so it will require a lot of new lookups. 3) After 1 minute, restart master leader. 4) After 1 minute, restart one tserver. 5) After this tserver complete boostrapping, restart tserver with maximum amount of leaders. Check that cluster recover after it. Reviewers: rsami, timur Reviewed By: rsami, timur Subscribers: bogdan, rsami, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8892

…okup request Summary: Currently when meta cache does not have entry for partition lookup request will acquire 2 locks. Initially it will acquire shared lock, check cache, then acquire unique lock and add callback to waiting list. If we have a lot of read/write requests and master communication issues it could result in hundreds of threads waiting on meta cache mutex. This state is very similar to deadlock, since processing response also requires taking the same lock. This diff changes waiting list to lock free multi producers single consumer list. So we could add callback while holding shared lock, and process response while holding unique lock. It helps to rule out scenarios when we have a lot of requests but master cannot serve lookup requests yet. So we would not get into jammed state and start processing as soon as master becomes responsive. Test Plan: 1) Launch heavy workload with async client, so number of concurrent request could be really high. 2) After block cache is filled start additional async workload, so it will require a lot of new lookups. 3) After 1 minute, restart master leader. 4) After 1 minute, restart one tserver. 5) After this tserver complete boostrapping, restart tserver with maximum amount of leaders. Check that cluster recover after it. Jenkins: patch: 2.2.0 Reviewers: rsami, timur Reviewed By: timur Subscribers: ybase, rsami, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8951

…okup request Summary: Currently when meta cache does not have entry for partition lookup request will acquire 2 locks. Initially it will acquire shared lock, check cache, then acquire unique lock and add callback to waiting list. If we have a lot of read/write requests and master communication issues it could result in hundreds of threads waiting on meta cache mutex. This state is very similar to deadlock, since processing response also requires taking the same lock. This diff changes waiting list to lock free multi producers single consumer list. So we could add callback while holding shared lock, and process response while holding unique lock. It helps to rule out scenarios when we have a lot of requests but master cannot serve lookup requests yet. So we would not get into jammed state and start processing as soon as master becomes responsive. Test Plan: 1) Launch heavy workload with async client, so number of concurrent request could be really high. 2) After block cache is filled start additional async workload, so it will require a lot of new lookups. 3) After 1 minute, restart master leader. 4) After 1 minute, restart one tserver. 5) After this tserver complete boostrapping, restart tserver with maximum amount of leaders. Check that cluster recover after it. Jenkins: patch: 2.1.8 Reviewers: rsami, timur Reviewed By: timur Subscribers: ybase, rsami, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8952

spolitov added the area/docdb YugabyteDB core features label Jul 13, 2020

spolitov self-assigned this Jul 13, 2020

spolitov closed this as completed Jul 16, 2020

robertsami mentioned this issue Jul 21, 2020

Avoid concurrent LookupTabletById for the same tablet #4969

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid taking unique lock when starting lookup request #5059

Avoid taking unique lock when starting lookup request #5059

spolitov commented Jul 13, 2020

Avoid taking unique lock when starting lookup request #5059

Avoid taking unique lock when starting lookup request #5059

Comments

spolitov commented Jul 13, 2020