[YSQL] Read latencies spike significantly (over 10ms), when there is a large concurrent write workload in the system. #11805

rthallamko3 · 2022-03-19T05:00:14Z

Description

In the presence of writer threads, simple reads in SQLInserts workload (even without indexes), in RF-3 setting takes over 10ms. Once when writer threads complete their job, latency for reads comes down to 2-3ms.

Repro: Setup Rf-3 cluster on a mac, run SqlInserts with 6 write threads [1] and 24 write threads [2]. Notice the spike in read latencies from about 2-3ms to 9-10ms.

[1]

java -jar yb-sample-apps.jar --workload  SqlInserts --nodes 127.0.0.1:5433 --num_threads_write 6
- Read: 0.00 ops/sec (0.00 ms/op), 0 total ops  |  Write: 1332.91 ops/sec (8.88 ms/op), 6670 total ops  |  Uptime: 10025 ms |
- Read: 598.01 ops/sec (6.56 ms/op), 2993 total ops  |  Write: 1289.30 ops/sec (4.65 ms/op), 13123 total ops  |  Uptime: 15030 ms |
- Read: 696.16 ops/sec (2.87 ms/op), 6477 total ops  |  Write: 1331.81 ops/sec (4.50 ms/op), 19788 total ops  |  Uptime: 20035 ms |
- Read: 649.23 ops/sec (3.08 ms/op), 9726 total ops  |  Write: 1221.33 ops/sec (4.91 ms/op), 25900 total ops  |  Uptime: 25039 ms |

[2]

 java -jar yb-sample-apps.jar --workload  SqlInserts --nodes 127.0.0.1:5433 --num_threads_write 24
- Read: 0.00 ops/sec (0.00 ms/op), 0 total ops  |  Write: 1648.15 ops/sec (27.33 ms/op), 8247 total ops  |  Uptime: 10026 ms |
- Read: 136.69 ops/sec (28.54 ms/op), 684 total ops  |  Write: 1786.63 ops/sec (13.43 ms/op), 17187 total ops  |  Uptime: 15030 ms |
- Read: 197.08 ops/sec (10.13 ms/op), 1670 total ops  |  Write: 1973.80 ops/sec (12.15 ms/op), 27062 total ops  |  Uptime: 20033 ms |
- Read: 208.62 ops/sec (9.58 ms/op), 2714 total ops  |  Write: 2160.73 ops/sec (11.11 ms/op), 37875 total ops  |  Uptime: 25038 ms |

The text was updated successfully, but these errors were encountered:

kmuthukk · 2022-03-20T01:05:58Z

@spolitov wrote:

we had logic to avoid picking up read time at LRT layer, for the first query in transaction. I don’t remember details, may be it was done only for YCQL. or may be it is just broken for YSQL. but, as far as I understand, it is the reason for 10ms latency here. We need to fix ysql code, to avoid picking read time in this scenario.

When read time is picked by YSQL layer, it matches clock at that moment - ht2. A moment before that tserver started to replicate write request - ht1. Then read request arrives to tserver, and it should wait until safe time >= read time.

Read time is ht2, and safe time < min not replicated operation. Since ht1 < ht2, we have to wait until this operation is replicated. And replication takes round trip time between nodes.

In multi-region situations since round-trip between nodes to replicate the write request at ht1 can be high.. for example 10ms between east-1 and east-2 AWS regions, this can add about a 10ms additional latency.

For SELECTs that involve single-shard we could optimize this by letting the TServer pick the read-time rather than YSQL layer.

rkarthik007 · 2022-03-21T16:01:05Z

Read time is ht2, and safe time < min not replicated operation. Since ht1 < ht2, we have to wait until this operation is replicated. And replication takes round trip time between nodes.

Assuming above that ht2 is the safe time for any tablet. A few points about this:

Even if ht2 > ht1 - there could be cases for a particular key (row or row/col combo) where there may be no updates queued for replication for that row. And the RPC from this query coordinator to that tserver would force the hybrid time of the tserver to at least ht2. Thus, in these cases, even if a new update were to come in, it CANNOT get in with a commit timestamp lower than ht2 - meaning it is safe to serve the read for that particular row at ht2 even if the safe time for the tablet as a whole is lower.
The above property can esp be made use of in cases like foreign key checks (where there may be no updates to the parent row), insert-only workloads (where there is no "update" for the row being inserted - even in distributed transaction scenarios where there are indexes on the table), or even in many regular SELECT operations.

The above would yield a huge win in multi-region scenarios, so are probably worth considering.

andrei-mart · 2022-03-21T16:55:43Z

It may be related to #11477
Both mention big write and read delay, but in #11477 they are issued by the same session, big write and follow up read.

rthallamko3 · 2022-03-22T00:38:42Z

Reran the CassandraKeyValue and SqlInserts workloads against RF-3 single zone deployment cluster.

CassandraKeyValue workload with different number of write threads (8,32,64). The increase in the number of write threads didn't necessarily increase the latencies of the reads. It stayed around 1-2ms range.

java -jar yb-sample-apps.jar --workload CassandraKeyValue --nodes 172.150.31.201:9042 --num_threads_write 8

Read: 16471.17 ops/sec (1.17 ms/op), 82522 total ops | Write: 2977.80 ops/sec (2.59 ms/op), 14893 total ops | Uptime: 5012 ms | maxWrittenKey: 14891 | maxGeneratedKey: 14900 |
Read: 20108.38 ops/sec (1.19 ms/op), 183094 total ops | Write: 2652.70 ops/sec (3.01 ms/op), 28158 total ops | Uptime: 10013 ms | maxWrittenKey: 28157 | maxGeneratedKey: 28165 |
Read: 20068.76 ops/sec (1.20 ms/op), 283446 total ops | Write: 2665.03 ops/sec (3.00 ms/op), 41484 total ops | Uptime: 15013 ms | maxWrittenKey: 41480 | maxGeneratedKey: 41491 |
Read: 20085.04 ops/sec (1.19 ms/op), 383877 total ops | Write: 2635.65 ops/sec (3.03 ms/op), 54663 total ops | Uptime: 20013 ms | maxWrittenKey: 54659 | maxGeneratedKey: 54670 |

java -jar yb-sample-apps.jar --workload CassandraKeyValue --nodes 172.150.31.201:9042 --num_threads_write 32

Read: 8947.46 ops/sec (1.50 ms/op), 44832 total ops | Write: 8579.92 ops/sec (3.14 ms/op), 42918 total ops | Uptime: 5013 ms | maxWrittenKey: 42898 | maxGeneratedKey: 42951 |
Read: 15670.19 ops/sec (1.53 ms/op), 123207 total ops | Write: 7713.35 ops/sec (4.15 ms/op), 81489 total ops | Uptime: 10013 ms | maxWrittenKey: 81476 | maxGeneratedKey: 81520 |
Read: 15581.30 ops/sec (1.54 ms/op), 201120 total ops | Write: 7902.75 ops/sec (4.05 ms/op), 121005 total ops | Uptime: 15013 ms | maxWrittenKey: 120981 | maxGeneratedKey: 121036 |
Read: 15316.47 ops/sec (1.57 ms/op), 277707 total ops | Write: 7740.34 ops/sec (4.13 ms/op), 159709 total ops | Uptime: 20014 ms | maxWrittenKey: 159691 | maxGeneratedKey: 159740 |

java -jar yb-sample-apps.jar --workload CassandraKeyValue --nodes 172.150.31.201:9042 --num_threads_write 64

Read: 3154.49 ops/sec (1.82 ms/op), 15804 total ops | Write: 11675.33 ops/sec (3.74 ms/op), 58393 total ops | Uptime: 5012 ms | maxWrittenKey: 58314 | maxGeneratedKey: 58457 |
Read: 11698.63 ops/sec (2.05 ms/op), 74315 total ops | Write: 12006.82 ops/sec (5.33 ms/op), 118434 total ops | Uptime: 10012 ms | maxWrittenKey: 118422 | maxGeneratedKey: 118497 |
Read: 11876.46 ops/sec (2.02 ms/op), 133702 total ops | Write: 11763.15 ops/sec (5.44 ms/op), 177253 total ops | Uptime: 15013 ms | maxWrittenKey: 177242 | maxGeneratedKey: 177316 |
Read: 11583.13 ops/sec (2.07 ms/op), 191621 total ops | Write: 11886.28 ops/sec (5.38 ms/op), 236688 total ops | Uptime: 20013 ms | maxWrittenKey: 236665 | maxGeneratedKey: 236751 |

SqlInserts workload with different number of write threads (8,32,64). The increase in the number of write threads impacted the overall latencies of both reads and writes significantly.

java -jar yb-sample-apps.jar --workload SqlInserts --nodes 172.150.31.201:5433 --num_threads_write 8

Read: 697.74 ops/sec (2.61 ms/op), 3498 total ops | Write: 2517.03 ops/sec (3.06 ms/op), 12588 total ops | Uptime: 5016 ms |
Read: 902.53 ops/sec (2.21 ms/op), 8012 total ops | Write: 2897.90 ops/sec (2.76 ms/op), 27080 total ops | Uptime: 10017 ms |
Read: 900.51 ops/sec (2.22 ms/op), 12515 total ops | Write: 2885.14 ops/sec (2.77 ms/op), 41507 total ops | Uptime: 15017 ms |
Read: 897.92 ops/sec (2.23 ms/op), 17005 total ops | Write: 2890.75 ops/sec (2.77 ms/op), 55962 total ops | Uptime: 20018 ms |

java -jar yb-sample-apps.jar --workload SqlInserts --nodes 172.150.31.201:5433 --num_threads_write 32

Read: 129.10 ops/sec (10.40 ms/op), 647 total ops | Write: 3560.01 ops/sec (7.58 ms/op), 17804 total ops | Uptime: 5015 ms |
Read: 290.09 ops/sec (6.88 ms/op), 2098 total ops | Write: 4307.93 ops/sec (7.43 ms/op), 39349 total ops | Uptime: 10015 ms |
Read: 296.37 ops/sec (6.75 ms/op), 3580 total ops | Write: 4406.56 ops/sec (7.26 ms/op), 61385 total ops | Uptime: 15016 ms |
Read: 290.54 ops/sec (6.88 ms/op), 5033 total ops | Write: 4279.90 ops/sec (7.47 ms/op), 82788 total ops | Uptime: 20017 ms |

java -jar yb-sample-apps.jar --workload SqlInserts --nodes 172.150.31.201:5433 --num_threads_write 64

Read: 14.56 ops/sec (48.04 ms/op), 73 total ops | Write: 2977.45 ops/sec (14.65 ms/op), 14890 total ops | Uptime: 5014 ms |
Read: 153.16 ops/sec (13.06 ms/op), 839 total ops | Write: 4427.77 ops/sec (14.45 ms/op), 37032 total ops | Uptime: 10015 ms |
Read: 144.19 ops/sec (13.83 ms/op), 1560 total ops | Write: 4356.59 ops/sec (14.70 ms/op), 58818 total ops | Uptime: 15015 ms |
Read: 145.98 ops/sec (13.72 ms/op), 2290 total ops | Write: 4131.49 ops/sec (15.49 ms/op), 79477 total ops | Uptime: 20016 ms |

rthallamko3 · 2022-03-23T16:55:23Z

@amitanandaiyer , can you take a stab at it?

rthallamko3 · 2022-04-01T18:22:03Z

#11886 should fix the single tablet latency issues.

…ible Summary: To achieve consistency of reads from multiple tablets and/or across multiple operations in the context of a single transaction, YSQL selects read time on the internal client side (in the Postgres process). This approach has a drawback in case client's hybrid clock shows a time in the future compared to tserver's time, caused by clock skew. On receiving the read request with such read time, the tserver will wait until the tablet's safe time has reached this future time, resulting in increased read latency. To prevent the tserver from waiting while processing the read request, the read time in the request should be omitted. In this case tserver will use the current safe time of the tablet as read time, and will return that time to the client (YSQL). The same read time should then be used by all other operations initiated by YSQL as part of the same transaction. **Note:** 1. When the first read operation perform reads from different tablets, we detect this case and pick the read time on the client side. (If we allowed each tablet server to pick its own read time, the reads from different tablet servers would be inconsistent with each other.) 2. Client should not initiate parallel read operations at the beginning of a transaction. Even if each operation reads just one tablet, all these operations should use the same read time for consistency. After the read time has been picked, e.g. for second operation of the transaction and beyond, parallel reads are fine. One case in which we send parallel operations in YSQL is foreign key checking. 3. In case of unexpected behavior of new functionality it could be disabled using the newly created gflag `force_preset_read_time_on_client`. Its default value is `false`, resulting in new behavior, and it should be set to `true` to revert to old behavior. 4. The fix in mainline is slightly different because of changes introduced by D13244 / c5f5125 and will be handled by this commit https://phabricator.dev.yugabyte.com/D16201. Test Plan: Jenkins: rebase: 2.12 New unit test is introduced ``` ./yb_build.sh --gtest_filter PgLibPqTest.NoReadRestartOnSingleTablet ``` Reviewers: sergei, mbautin, amitanand Reviewed By: sergei, mbautin, amitanand Subscribers: mbautin, yql Differential Revision: https://phabricator.dev.yugabyte.com/D16345

…sible Summary: To achieve consistency of reads from multiple tablets and/or across multiple operations in the context of a single transaction, YSQL selects read time on the internal client side (in the Postgres process). This approach has a drawback in case client's hybrid clock shows a time in the future compared to tserver's time, caused by clock skew. On receiving the read request with such read time, the tserver will wait until the tablet's safe time has reached this future time, resulting in increased read latency. To prevent the tserver from waiting while processing the read request, the read time in the request should be omitted. In this case tserver will use the current safe time of the tablet as read time, and will return that time to the client (YSQL). The same read time should then be used by all other operations initiated by YSQL as part of the same transaction. **Note:** 1. When the first read operation perform reads from different tablets, we detect this case and pick the read time on the client side. (If we allowed each tablet server to pick its own read time, the reads from different tablet servers would be inconsistent with each other.) 2. Client should not initiate parallel read operations at the beginning of a transaction. Even if each operation reads just one tablet, all these operations should use the same read time for consistency. After the read time has been picked, e.g. for second operation of the transaction and beyond, parallel reads are fine. One case in which we send parallel operations in YSQL is foreign key checking. 3. In case of unexpected behavior of new functionality it could be disabled using the newly created gflag `force_preset_read_time_on_client`. Its default value is `false`, resulting in new behavior, and it should be set to `true` to revert to old behavior. 4. The fix in mainline is slightly different because of changes introduced by D13244 / c5f5125 and will be handled by this commit https://phabricator.dev.yugabyte.com/D16201. Original commit: D16345 / ba1504e Test Plan: Jenkins: rebase: 2.8 New unit test is introduced ``` ./yb_build.sh --gtest_filter PgLibPqTest.NoReadRestartOnSingleTablet ``` Reviewers: mbautin, sergei, amitanand Reviewed By: sergei, amitanand Subscribers: yql, mbautin Differential Revision: https://phabricator.dev.yugabyte.com/D16548

…sible Summary: To achieve consistency of reads from multiple tablets and/or across multiple operations in the context of a single transaction, YSQL selects read time on the internal client side (in the Postgres process). This approach has a drawback in case client's hybrid clock shows a time in the future compared to tserver's time, caused by clock skew. On receiving the read request with such read time, the tserver will wait until the tablet's safe time has reached this future time, resulting in increased read latency. To prevent the tserver from waiting while processing the read request, the read time in the request should be omitted. In this case tserver will use the current safe time of the tablet as read time, and will return that time to the client (YSQL). The same read time should then be used by all other operations initiated by YSQL as part of the same transaction. **Note:** 1. When the first read operation perform reads from different tablets, we detect this case and pick the read time on the client side. (If we allowed each tablet server to pick its own read time, the reads from different tablet servers would be inconsistent with each other.) 2. Client should not initiate parallel read operations at the beginning of a transaction. Even if each operation reads just one tablet, all these operations should use the same read time for consistency. After the read time has been picked, e.g. for second operation of the transaction and beyond, parallel reads are fine. One case in which we send parallel operations in YSQL is foreign key checking. 3. In case of unexpected behavior of new functionality it could be disabled using the newly created gflag `force_preset_read_time_on_client`. Its default value is `false`, resulting in new behavior, and it should be set to `true` to revert to old behavior. 4. The fix in mainline is slightly different because of changes introduced by D13244 / c5f5125 and will be handled by this commit https://phabricator.dev.yugabyte.com/D16201. Original commit: D16345 / ba1504e Test Plan: Jenkins: rebase: 2.6 New unit test is introduced ``` ./yb_build.sh --gtest_filter PgLibPqTest.NoReadRestartOnSingleTablet ``` Reviewers: mbautin, sergei, amitanand Reviewed By: sergei, amitanand Subscribers: yql, mbautin Differential Revision: https://phabricator.dev.yugabyte.com/D16547

rthallamko3 · 2022-04-28T19:20:54Z

#11886 tracks the fix for the issue in 2.13 /master branch.

…ble for Read Committed isolation Summary: In Read Committed isolation, a new read time is picked for each statement (i.e., a new logical snapshot of the database is used for each statement's reads). This is done (in PgClientService) by setting the read time to the current time at the start of each new statement before issuing requests to any tserver. However, this might results in high latencies in the first read op that is executed as part of that statement because the tablet serving the read (likely on another node) might have to wait for the "safe" time to reach the picked read time. A long wait for safe time is usually seen when there are concurrent writes to the tablet and the read enters while the raft replication that moves the safe time ahead is still in progress (see yugabyte#11805). This issue is avoided in Repeatable Read isolation because there, the first tablet serving the read in a transaction is allowed to pick the read time as the latest available "safe" time without having to wait for any catchup. This read time is sent back to PgClientService as used_read_time so that future reads can use the same read time. Note that even in Repeatable Read isolation, in case, there are multiple parallel RPCs to various tservers, the read time is still picked on the PgClientService because otherwise, the rpcs would have to wait for one of them to execute and came back with a used_read_time. This diff extends the same logic to Read Committed isolation. Test Plan: Jenkins: skip ./yb_build.sh --java-test org.yb.pgsql.TestPgTransactions#testReadPointInReadCommittedIsolation ./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress Reviewers: dmitry Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D24075

…n Read Committed isolation Summary: In Read Committed isolation, a new read time is picked for each statement (i.e., a new logical snapshot of the database is used for each statement's reads). This is done (in PgClientService) by setting the read time to the current time at the start of each new statement before issuing requests to any tserver. However, this might results in high latencies in the first read op that is executed as part of that statement because the tablet serving the read (likely on another node) might have to wait for the "safe" time to reach the picked read time. A long wait for safe time is usually seen when there are concurrent writes to the tablet and the read enters while the raft replication that moves the safe time ahead is still in progress (see yugabyte#11805). This issue is avoided in Repeatable Read isolation because there, the first tablet serving the read in a transaction is allowed to pick the read time as the latest available "safe" time without having to wait for any catchup. This read time is sent back to PgClientService as used_read_time so that future reads can use the same read time. Note that even in Repeatable Read isolation, in case, there are multiple parallel RPCs to various tservers, the read time is still picked on the PgClientService because otherwise, the rpcs would have to wait for one of them to execute and came back with a used_read_time. This diff extends the same logic to Read Committed isolation. Test Plan: ./yb_build.sh --java-test org.yb.pgsql.TestPgTransactions#testReadPointInReadCommittedIsolation ./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress Reviewers: dmitry Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D24075

…ommitted isolation Summary: In Read Committed isolation, a new read time is picked for each statement (i.e., a new logical snapshot of the database is used for each statement's reads). This is done (in PgClientService) by setting the read time to the current time at the start of each new statement before issuing requests to any tserver. However, this might results in high latencies in the first read op that is executed as part of that statement because the tablet serving the read (likely on another node) might have to wait for the "safe" time to reach the picked read time. A long wait for safe time is usually seen when there are concurrent writes to the tablet and the read enters while the raft replication that moves the safe time ahead is still in progress (see #11805). This issue is avoided in Repeatable Read isolation because there, the first tablet serving the read in a transaction is allowed to pick the read time as the latest available "safe" time without having to wait for any catchup. This read time is sent back to PgClientService as used_read_time so that future reads can use the same read time. Note that even in Repeatable Read isolation, in case, there are multiple parallel RPCs to various tservers, the read time is still picked on the PgClientService because otherwise, the rpcs would have to wait for one of them to execute and came back with a used_read_time. This diff extends the same logic to Read Committed isolation. Jira: DB-5248 Test Plan: ./yb_build.sh --java-test org.yb.pgsql.TestPgTransactions#testReadPointInReadCommittedIsolation ./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress Reviewers: dmitry Reviewed By: dmitry Subscribers: dsrinivasan, gkukreja, yql, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D24075

rthallamko3 added the area/docdb YugabyteDB core features label Mar 19, 2022

kmuthukk added the area/ysql Yugabyte SQL (YSQL) label Mar 20, 2022

kmuthukk assigned m-iancu Mar 20, 2022

kmuthukk added the priority/high High Priority label Mar 20, 2022

rthallamko3 changed the title ~~[DocDB] Read latencies spike significantly (over 10ms), when there is a large concurrent write workload in the system.~~ [YSQL] Read latencies spike significantly (over 10ms), when there is a large concurrent write workload in the system. Mar 22, 2022

kmuthukk removed the area/docdb YugabyteDB core features label Mar 22, 2022

sushantrmishra assigned amitanandaiyer Mar 28, 2022

tverona1 added the status/awaiting-triage Issue awaiting triage label Apr 6, 2022

rthallamko3 assigned d-uspenskiy and unassigned m-iancu and amitanandaiyer Apr 15, 2022

rthallamko3 removed the status/awaiting-triage Issue awaiting triage label Apr 28, 2022

rthallamko3 closed this as completed Apr 28, 2022

pkj415 mentioned this issue Mar 31, 2023

[YSQL] Allow read committed isolation to pick read time using safe time on first rpc to tserver #15856

Closed

ryan-ally mentioned this issue Nov 30, 2023

[Snyk] Fix for 1 vulnerabilities ryan-ally/yugabyte-db#213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Read latencies spike significantly (over 10ms), when there is a large concurrent write workload in the system. #11805

[YSQL] Read latencies spike significantly (over 10ms), when there is a large concurrent write workload in the system. #11805

rthallamko3 commented Mar 19, 2022 •

edited by kmuthukk

Loading

kmuthukk commented Mar 20, 2022 •

edited

Loading

rkarthik007 commented Mar 21, 2022

andrei-mart commented Mar 21, 2022

rthallamko3 commented Mar 22, 2022

rthallamko3 commented Mar 23, 2022

rthallamko3 commented Apr 1, 2022

rthallamko3 commented Apr 28, 2022

[YSQL] Read latencies spike significantly (over 10ms), when there is a large concurrent write workload in the system. #11805

[YSQL] Read latencies spike significantly (over 10ms), when there is a large concurrent write workload in the system. #11805

Comments

rthallamko3 commented Mar 19, 2022 • edited by kmuthukk Loading

Description

kmuthukk commented Mar 20, 2022 • edited Loading

rkarthik007 commented Mar 21, 2022

andrei-mart commented Mar 21, 2022

rthallamko3 commented Mar 22, 2022

rthallamko3 commented Mar 23, 2022

rthallamko3 commented Apr 1, 2022

rthallamko3 commented Apr 28, 2022

rthallamko3 commented Mar 19, 2022 •

edited by kmuthukk

Loading

kmuthukk commented Mar 20, 2022 •

edited

Loading