-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docdb] During bootstrap skip flushed entries could skip log segment that contain committed operation #3759
Labels
Comments
This looks like the same thing we hit AFTER the initial crash in #3413 (comment) cc @rahuldesirazu |
mbautin
added a commit
that referenced
this issue
Jul 21, 2020
… incorrect entries on tablet bootstrap Summary: Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Ppart of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. #3759: potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. #4983: only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, sergei, rahuldesirazu Reviewed By: sergei Subscribers: sergei, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8917
mbautin
added a commit
that referenced
this issue
Jul 23, 2020
…leader and replay of incorrect entries on tablet bootstrap Summary: This is a backport of e4e2d39 (D8917). Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Part of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. [#3759] Potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. [#4983] Only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins: urgent, patch: 2.1.8, auto rebase: no A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, rahuldesirazu, sergei Reviewed By: sergei Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8990
mbautin
added a commit
that referenced
this issue
Jul 23, 2020
…leader and replay of incorrect entries on tablet bootstrap Summary: This is a backport of e4e2d39 (D8917). Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Part of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. [#3759] Potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. [#4983] Only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins: patch: 2.2.0 A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, rahuldesirazu, sergei Reviewed By: sergei Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8988
bmatican
changed the title
During bootstrap skip flushed entries could skip log segment that contain committed operation
[docdb] During bootstrap skip flushed entries could skip log segment that contain committed operation
Oct 10, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
The could result in the following crash loop:
The text was updated successfully, but these errors were encountered: