-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Read latencies spike significantly (over 10ms), when there is a large concurrent write workload in the system. #11805
Comments
@spolitov wrote:
In multi-region situations since round-trip between nodes to replicate the write request at ht1 can be high.. for example 10ms between east-1 and east-2 AWS regions, this can add about a 10ms additional latency. For SELECTs that involve single-shard we could optimize this by letting the TServer pick the read-time rather than YSQL layer. |
Assuming above that
The above would yield a huge win in multi-region scenarios, so are probably worth considering. |
Reran the CassandraKeyValue and SqlInserts workloads against RF-3 single zone deployment cluster.
java -jar yb-sample-apps.jar --workload CassandraKeyValue --nodes 172.150.31.201:9042 --num_threads_write 8
java -jar yb-sample-apps.jar --workload CassandraKeyValue --nodes 172.150.31.201:9042 --num_threads_write 32
java -jar yb-sample-apps.jar --workload CassandraKeyValue --nodes 172.150.31.201:9042 --num_threads_write 64
java -jar yb-sample-apps.jar --workload SqlInserts --nodes 172.150.31.201:5433 --num_threads_write 8
java -jar yb-sample-apps.jar --workload SqlInserts --nodes 172.150.31.201:5433 --num_threads_write 32
java -jar yb-sample-apps.jar --workload SqlInserts --nodes 172.150.31.201:5433 --num_threads_write 64
|
@amitanandaiyer , can you take a stab at it? |
#11886 should fix the single tablet latency issues. |
…ible Summary: To achieve consistency of reads from multiple tablets and/or across multiple operations in the context of a single transaction, YSQL selects read time on the internal client side (in the Postgres process). This approach has a drawback in case client's hybrid clock shows a time in the future compared to tserver's time, caused by clock skew. On receiving the read request with such read time, the tserver will wait until the tablet's safe time has reached this future time, resulting in increased read latency. To prevent the tserver from waiting while processing the read request, the read time in the request should be omitted. In this case tserver will use the current safe time of the tablet as read time, and will return that time to the client (YSQL). The same read time should then be used by all other operations initiated by YSQL as part of the same transaction. **Note:** 1. When the first read operation perform reads from different tablets, we detect this case and pick the read time on the client side. (If we allowed each tablet server to pick its own read time, the reads from different tablet servers would be inconsistent with each other.) 2. Client should not initiate parallel read operations at the beginning of a transaction. Even if each operation reads just one tablet, all these operations should use the same read time for consistency. After the read time has been picked, e.g. for second operation of the transaction and beyond, parallel reads are fine. One case in which we send parallel operations in YSQL is foreign key checking. 3. In case of unexpected behavior of new functionality it could be disabled using the newly created gflag `force_preset_read_time_on_client`. Its default value is `false`, resulting in new behavior, and it should be set to `true` to revert to old behavior. 4. The fix in mainline is slightly different because of changes introduced by D13244 / c5f5125 and will be handled by this commit https://phabricator.dev.yugabyte.com/D16201. Test Plan: Jenkins: rebase: 2.12 New unit test is introduced ``` ./yb_build.sh --gtest_filter PgLibPqTest.NoReadRestartOnSingleTablet ``` Reviewers: sergei, mbautin, amitanand Reviewed By: sergei, mbautin, amitanand Subscribers: mbautin, yql Differential Revision: https://phabricator.dev.yugabyte.com/D16345
…sible Summary: To achieve consistency of reads from multiple tablets and/or across multiple operations in the context of a single transaction, YSQL selects read time on the internal client side (in the Postgres process). This approach has a drawback in case client's hybrid clock shows a time in the future compared to tserver's time, caused by clock skew. On receiving the read request with such read time, the tserver will wait until the tablet's safe time has reached this future time, resulting in increased read latency. To prevent the tserver from waiting while processing the read request, the read time in the request should be omitted. In this case tserver will use the current safe time of the tablet as read time, and will return that time to the client (YSQL). The same read time should then be used by all other operations initiated by YSQL as part of the same transaction. **Note:** 1. When the first read operation perform reads from different tablets, we detect this case and pick the read time on the client side. (If we allowed each tablet server to pick its own read time, the reads from different tablet servers would be inconsistent with each other.) 2. Client should not initiate parallel read operations at the beginning of a transaction. Even if each operation reads just one tablet, all these operations should use the same read time for consistency. After the read time has been picked, e.g. for second operation of the transaction and beyond, parallel reads are fine. One case in which we send parallel operations in YSQL is foreign key checking. 3. In case of unexpected behavior of new functionality it could be disabled using the newly created gflag `force_preset_read_time_on_client`. Its default value is `false`, resulting in new behavior, and it should be set to `true` to revert to old behavior. 4. The fix in mainline is slightly different because of changes introduced by D13244 / c5f5125 and will be handled by this commit https://phabricator.dev.yugabyte.com/D16201. Original commit: D16345 / ba1504e Test Plan: Jenkins: rebase: 2.8 New unit test is introduced ``` ./yb_build.sh --gtest_filter PgLibPqTest.NoReadRestartOnSingleTablet ``` Reviewers: mbautin, sergei, amitanand Reviewed By: sergei, amitanand Subscribers: yql, mbautin Differential Revision: https://phabricator.dev.yugabyte.com/D16548
…sible Summary: To achieve consistency of reads from multiple tablets and/or across multiple operations in the context of a single transaction, YSQL selects read time on the internal client side (in the Postgres process). This approach has a drawback in case client's hybrid clock shows a time in the future compared to tserver's time, caused by clock skew. On receiving the read request with such read time, the tserver will wait until the tablet's safe time has reached this future time, resulting in increased read latency. To prevent the tserver from waiting while processing the read request, the read time in the request should be omitted. In this case tserver will use the current safe time of the tablet as read time, and will return that time to the client (YSQL). The same read time should then be used by all other operations initiated by YSQL as part of the same transaction. **Note:** 1. When the first read operation perform reads from different tablets, we detect this case and pick the read time on the client side. (If we allowed each tablet server to pick its own read time, the reads from different tablet servers would be inconsistent with each other.) 2. Client should not initiate parallel read operations at the beginning of a transaction. Even if each operation reads just one tablet, all these operations should use the same read time for consistency. After the read time has been picked, e.g. for second operation of the transaction and beyond, parallel reads are fine. One case in which we send parallel operations in YSQL is foreign key checking. 3. In case of unexpected behavior of new functionality it could be disabled using the newly created gflag `force_preset_read_time_on_client`. Its default value is `false`, resulting in new behavior, and it should be set to `true` to revert to old behavior. 4. The fix in mainline is slightly different because of changes introduced by D13244 / c5f5125 and will be handled by this commit https://phabricator.dev.yugabyte.com/D16201. Original commit: D16345 / ba1504e Test Plan: Jenkins: rebase: 2.6 New unit test is introduced ``` ./yb_build.sh --gtest_filter PgLibPqTest.NoReadRestartOnSingleTablet ``` Reviewers: mbautin, sergei, amitanand Reviewed By: sergei, amitanand Subscribers: yql, mbautin Differential Revision: https://phabricator.dev.yugabyte.com/D16547
#11886 tracks the fix for the issue in 2.13 /master branch. |
…ble for Read Committed isolation Summary: In Read Committed isolation, a new read time is picked for each statement (i.e., a new logical snapshot of the database is used for each statement's reads). This is done (in PgClientService) by setting the read time to the current time at the start of each new statement before issuing requests to any tserver. However, this might results in high latencies in the first read op that is executed as part of that statement because the tablet serving the read (likely on another node) might have to wait for the "safe" time to reach the picked read time. A long wait for safe time is usually seen when there are concurrent writes to the tablet and the read enters while the raft replication that moves the safe time ahead is still in progress (see yugabyte#11805). This issue is avoided in Repeatable Read isolation because there, the first tablet serving the read in a transaction is allowed to pick the read time as the latest available "safe" time without having to wait for any catchup. This read time is sent back to PgClientService as used_read_time so that future reads can use the same read time. Note that even in Repeatable Read isolation, in case, there are multiple parallel RPCs to various tservers, the read time is still picked on the PgClientService because otherwise, the rpcs would have to wait for one of them to execute and came back with a used_read_time. This diff extends the same logic to Read Committed isolation. Test Plan: Jenkins: skip ./yb_build.sh --java-test org.yb.pgsql.TestPgTransactions#testReadPointInReadCommittedIsolation ./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress Reviewers: dmitry Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D24075
…n Read Committed isolation Summary: In Read Committed isolation, a new read time is picked for each statement (i.e., a new logical snapshot of the database is used for each statement's reads). This is done (in PgClientService) by setting the read time to the current time at the start of each new statement before issuing requests to any tserver. However, this might results in high latencies in the first read op that is executed as part of that statement because the tablet serving the read (likely on another node) might have to wait for the "safe" time to reach the picked read time. A long wait for safe time is usually seen when there are concurrent writes to the tablet and the read enters while the raft replication that moves the safe time ahead is still in progress (see yugabyte#11805). This issue is avoided in Repeatable Read isolation because there, the first tablet serving the read in a transaction is allowed to pick the read time as the latest available "safe" time without having to wait for any catchup. This read time is sent back to PgClientService as used_read_time so that future reads can use the same read time. Note that even in Repeatable Read isolation, in case, there are multiple parallel RPCs to various tservers, the read time is still picked on the PgClientService because otherwise, the rpcs would have to wait for one of them to execute and came back with a used_read_time. This diff extends the same logic to Read Committed isolation. Test Plan: ./yb_build.sh --java-test org.yb.pgsql.TestPgTransactions#testReadPointInReadCommittedIsolation ./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress Reviewers: dmitry Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D24075
…n Read Committed isolation Summary: In Read Committed isolation, a new read time is picked for each statement (i.e., a new logical snapshot of the database is used for each statement's reads). This is done (in PgClientService) by setting the read time to the current time at the start of each new statement before issuing requests to any tserver. However, this might results in high latencies in the first read op that is executed as part of that statement because the tablet serving the read (likely on another node) might have to wait for the "safe" time to reach the picked read time. A long wait for safe time is usually seen when there are concurrent writes to the tablet and the read enters while the raft replication that moves the safe time ahead is still in progress (see yugabyte#11805). This issue is avoided in Repeatable Read isolation because there, the first tablet serving the read in a transaction is allowed to pick the read time as the latest available "safe" time without having to wait for any catchup. This read time is sent back to PgClientService as used_read_time so that future reads can use the same read time. Note that even in Repeatable Read isolation, in case, there are multiple parallel RPCs to various tservers, the read time is still picked on the PgClientService because otherwise, the rpcs would have to wait for one of them to execute and came back with a used_read_time. This diff extends the same logic to Read Committed isolation. Test Plan: ./yb_build.sh --java-test org.yb.pgsql.TestPgTransactions#testReadPointInReadCommittedIsolation ./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress Reviewers: dmitry Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D24075
…ommitted isolation Summary: In Read Committed isolation, a new read time is picked for each statement (i.e., a new logical snapshot of the database is used for each statement's reads). This is done (in PgClientService) by setting the read time to the current time at the start of each new statement before issuing requests to any tserver. However, this might results in high latencies in the first read op that is executed as part of that statement because the tablet serving the read (likely on another node) might have to wait for the "safe" time to reach the picked read time. A long wait for safe time is usually seen when there are concurrent writes to the tablet and the read enters while the raft replication that moves the safe time ahead is still in progress (see #11805). This issue is avoided in Repeatable Read isolation because there, the first tablet serving the read in a transaction is allowed to pick the read time as the latest available "safe" time without having to wait for any catchup. This read time is sent back to PgClientService as used_read_time so that future reads can use the same read time. Note that even in Repeatable Read isolation, in case, there are multiple parallel RPCs to various tservers, the read time is still picked on the PgClientService because otherwise, the rpcs would have to wait for one of them to execute and came back with a used_read_time. This diff extends the same logic to Read Committed isolation. Jira: DB-5248 Test Plan: ./yb_build.sh --java-test org.yb.pgsql.TestPgTransactions#testReadPointInReadCommittedIsolation ./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress Reviewers: dmitry Reviewed By: dmitry Subscribers: dsrinivasan, gkukreja, yql, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D24075
Description
In the presence of writer threads, simple reads in SQLInserts workload (even without indexes), in RF-3 setting takes over 10ms. Once when writer threads complete their job, latency for reads comes down to 2-3ms.
Repro: Setup Rf-3 cluster on a mac, run SqlInserts with 6 write threads [1] and 24 write threads [2]. Notice the spike in read latencies from about 2-3ms to 9-10ms.
[1]
[2]
The text was updated successfully, but these errors were encountered: