-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance drops over time for UPDATE to same row #1240
Comments
thx @jeffreims for reporting this issue, and providing the test case as well. We were able to reproduce the issue, and problem seems specific to a row being updated several times, where we have intent records (from previous mutations to the row) that aren't being garbage collected aggressively. We are working on an optimization to speed up this case, and will keep you posted. Meanwhile, one question: will it be a common workload pattern in your use case that the same row will undergo thousands of updates within a short span of time? Or was it more symptomatic of the way the test program was written. If in practice, the updates will be spread over many rows and updates to same row will spread out over time, then you should not really see this issue because the garbage collection of older intent records will keep happening periodically. |
We normally will not be making that many updates within a short time on a single row. |
thx for the additional data point @jeffreims. Glad to hear this won't be a common pattern for you in the near term. Nevertheless, we'll work on optimizing this case. |
Summary: RocksDB contains 2 implementations of skip list. `InlineSkipList`, which we've been using prior to this revision, allows concurrent writes and store keys of variable length in Node. `SkipList` allows concurrent reads, but only one writer at a time, so it should really be called `SingleWriterSkipList`. Since we write to RocksDB only from single thread, we could benefit from using `SkipList`. Also it is easier to implement erase in such list. This diff adapts interface of `SkipList` (the single-writer skip list) so it could be used by SkipListRep. Also it adds ability to store keys of variable length in Node. Test Plan: Jenkins Reviewers: timur, mikhail Reviewed By: mikhail Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D6595
This can be fixed by changing how the cache works. |
hi @ddorian -- There is fix for this in the works, and under code review. At a high level it is has to do with more aggressively garbage collecting provisional/intent records for distributed transactions. Will keep you posted. We plan to move our code review (phabricator tool) also in the open soon. So in future, we would be able to reference/point to the "work-in-progress changes" easily. |
Summary: When a YugaByte DB transaction writes data, it does so in two phases: 1) Intents are written to the intents RocksDB, showing that the transaction is planning to write the corresponding values when it commits. 2) After the transaction is committed, those intents are converted to values in the regular RocksDB and deleted from the intents RocksDB of the tablet. But RocksDB does not simply delete the original record from memory. Instead of that, it adds a "delete mark" for this record. So both the original record and the delete mark remain in the memtable. When the flush happens, RocksDB could drop the original record but has to keep the delete mark, because it does not "know" that this record is not present in earlier SSTables. By the default RocksDB logic, such delete marks could only be deleted during a major compaction. While such delete marks are not seen by the user, they are always fetched while checking for the appropriate key. That could significantly decrease performance, especially in a "hot row" use case where the same row keeps getting overwritten. This diff adds a new option to RocksDB, `in_memory_erase`, and this option is used for the intents RocksDB. When it is set to true, during deletion RocksDB would try to erase the key from the memtable directly. And if it succeeds, then the delete mark would not be written at all. There are the following restrictions: 1) All keys should be unique because otherwise the previous value of such key could be present in SSTable file and would not be deleted. It is already so for our RocksDBs because we always use key suffixes based on hybrid time. 2) With in-memory erase enabled, a RocksDB iterator won't be able to read a point-in-time snapshot of data anymore (which is normally based on RocksDB sequence numbers), because key/value pairs belonging to such a snapshot could be removed in parallel. For the intents DB it is manageable because we only delete intents for already applied or aborted transactions, so these intents can't participate in read results anyway. However, to make this work, we have to delay the deletion of intents until the transaction is removed from the transaction participant, which is another change in this diff. The following issues and improvements were done to make tests pass: 1) Store recently removed transactions, that helps to avoid: - Attempt to load metadata them from RocksDB. - Add intents for the recently removed transaction, since they will be cleaned much later. 2) Correctly fill `matadata_state` in the client part of the transaction when read operations are being prepared. 3) Ignore serial_no when notifying status waiter about the aborted transaction. Test Plan: ybd ybd --cxx-test snapshot-txn-test --gtest_filter SnapshotTxnTest.HotRow Reviewers: timur, alex, mikhail Reviewed By: mikhail Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D6860
I noticed performance drops when updating a single row many times in a table.
The performance drop seems to increase the more the row is updated and the performance drops for both updating and selecting this row.
This is the performance using SELECT on a row that has been updated many times (>20k times)
Now inserting a new row in the same table:
The text was updated successfully, but these errors were encountered: