[docdb] During bootstrap skip flushed entries could skip log segment that contain committed operation #3759

spolitov · 2020-02-27T10:04:57Z

The could result in the following crash loop:

[yugabyte@ip-192-168-238-53 logs]$ cat yb-tserver.FATAL.details.2020-02-27T08_38_02.pid100248.txt
F20200227 08:38:02 ../../src/yb/consensus/replica_state.cc:791] Check failed: pending_operations_.front()->id().index() == last_committed_op_id_.index + 1 (108780008 vs. 108776042) 
    @     0x7f1024ae1fd5  yb::LogFatalHandlerSink::send()
    @     0x7f10242d9466  google::LogMessage::SendToLog()
    @     0x7f10242d683a  google::LogMessage::Flush()
    @     0x7f10242d9a89  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f1026650a98  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0x7f102665107a  yb::consensus::ReplicaState::UpdateMajorityReplicatedUnlocked()
    @     0x7f102662aea6  yb::consensus::RaftConsensus::UpdateMajorityReplicated()
    @     0x7f102660ce4e  yb::consensus::PeerMessageQueue::NotifyObserversOfMajorityReplOpChangeTask()
    @     0x7f1024b70ea4  yb::ThreadPool::DispatchThread()
    @     0x7f1024b6d6b3  yb::Thread::SuperviseThread()
    @     0x7f1022c3afc4  start_thread
    @     0x7f1022b399bd  __clone
    @              (nil)  (unknown)

The text was updated successfully, but these errors were encountered:

bmatican · 2020-02-27T16:59:35Z

This looks like the same thing we hit AFTER the initial crash in #3413 (comment) cc @rahuldesirazu

… incorrect entries on tablet bootstrap Summary: Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Ppart of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. #3759: potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. #4983: only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, sergei, rahuldesirazu Reviewed By: sergei Subscribers: sergei, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8917

…leader and replay of incorrect entries on tablet bootstrap Summary: This is a backport of e4e2d39 (D8917). Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Part of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. [#3759] Potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. [#4983] Only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins: urgent, patch: 2.1.8, auto rebase: no A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, rahuldesirazu, sergei Reviewed By: sergei Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8990

…leader and replay of incorrect entries on tablet bootstrap Summary: This is a backport of e4e2d39 (D8917). Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Part of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. [#3759] Potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. [#4983] Only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins: patch: 2.2.0 A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, rahuldesirazu, sergei Reviewed By: sergei Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8988

spolitov added the area/docdb YugabyteDB core features label Feb 27, 2020

rao-vasireddy assigned bmatican Feb 27, 2020

rao-vasireddy added kind/bug This issue is a bug priority/high High Priority labels Feb 27, 2020

bmatican assigned rahuldesirazu and unassigned bmatican Feb 28, 2020

bmatican changed the title ~~During bootstrap skip flushed entries could skip log segment that contain committed operation~~ [docdb] During bootstrap skip flushed entries could skip log segment that contain committed operation Oct 10, 2020

bmatican assigned mbautin and unassigned rahuldesirazu Oct 10, 2020

bmatican closed this as completed Oct 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docdb] During bootstrap skip flushed entries could skip log segment that contain committed operation #3759

[docdb] During bootstrap skip flushed entries could skip log segment that contain committed operation #3759

spolitov commented Feb 27, 2020 •

edited by daniel-yb

Loading

bmatican commented Feb 27, 2020

[docdb] During bootstrap skip flushed entries could skip log segment that contain committed operation #3759

[docdb] During bootstrap skip flushed entries could skip log segment that contain committed operation #3759

Comments

spolitov commented Feb 27, 2020 • edited by daniel-yb Loading

bmatican commented Feb 27, 2020

spolitov commented Feb 27, 2020 •

edited by daniel-yb

Loading