-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hybrid time too low in RaftConsensus::Start #5003
Comments
Timestamp of the last replicated record:
New operation's hybrid time:
In {
ReplicaState::UniqueLock lock;
RETURN_NOT_OK(state_->LockForStart(&lock));
state_->ClearLeaderUnlocked();
RETURN_NOT_OK_PREPEND(state_->StartUnlocked(info.last_id),
"Unable to start RAFT ReplicaState");
LOG_WITH_PREFIX(INFO) << "Replica starting. Triggering "
<< info.orphaned_replicates.size()
<< " pending operations. Active config: "
<< state_->GetActiveConfigUnlocked().ShortDebugString();
for (const auto& replicate : info.orphaned_replicates) {
ReplicateMsgPtr replicate_ptr = std::make_shared<ReplicateMsg>(*replicate);
RETURN_NOT_OK(StartReplicaOperationUnlocked(replicate_ptr, HybridTime::kInvalid));
}
RETURN_NOT_OK(state_->InitCommittedOpIdUnlocked(yb::OpId::FromPB(info.last_committed_id)));
queue_->Init(state_->GetLastReceivedOpIdUnlocked().ToPB<OpId>());
} (see the call to StartReplicaOperationUnlocked). |
YugabyteDB version: |
Looking at the WAL dumps. wal-000000007 and wal-000000008 are the files containing some data before the crash, and wal-000000009 and beyond were generated by repeated restarts. Trying to locate the new operation in the WAL. This yields the following:
|
A slightly different way to look at the log (skipping the application data):
The problematic entry with the entries before and after:
|
A way to look at only op ids and committed op ids:
It is clear that the committed op id gets stuck at 8.2107 for a long time:
Eventually, term 9's leader replicates 9.2103, apparently overwriting entries 8.2108 to 8.2513. |
When looking the next log file (
I see the two last entries
We have this code in tablet bootstrap which is supposed to erase entries that have been overwritten by entries from later terms from pending_replicates.
When processing entry 9.2108 (the NO OP entry replicated by term 9's leader)
it is supposed to have erased older entries with indexes 2108 and higher (8.2108 through 8.2513 inclusively, also erasing the operation 8.2400, which is clearly is not happening, because that operation is causing the crash later). |
Attaching full output of |
And I will also mention, just in case, that the last replicated hybrid time is 1594065993286094 * 4096 = 6529294308499841024 (we use 12 bits for the logical time component), which corresponds to the last entry from
|
Analyzing the data from the tablet server log. Flushed OpIds are 17.2399 for both regular and intents RocksDBs. I0708 14:37:55.957229 8517 version_set.cc:2818] T 91bf626f5794471ba61db14f07aaca8f P 621efb7a1e9544abbccb945b6b13d9be [R]: Recovered from manifest file:/mnt/disk1/yb-data/tserver/data/rocksdb/table-e7bd70eb6f3f4051b90c9effb2f7352c/tablet-91bf626f5794471ba61db14f07aaca8f/MANIFEST-001767 succeeded,manifest_file_number is 1767, next_file_number is 1769, last_sequence is 1125899910856910, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is 0x000000002e9bf350 -> { op_id: 17.2399 hybrid_time: { physical: 1594065993286094 } history_cutoff: { physical: 1594065867642616 } hybrid_time_filter: } |
So the reason we do not get to the code that removes records 8.2108 up to and including 8.2513 from pending_replicates when 9.2109 is encountered ( yugabyte-db/src/yb/tablet/tablet_bootstrap.cc Line 705 in 7724903
auto min_index = std::min(
replay_state_->regular_stored_op_id.index(), replay_state_->intents_stored_op_id.index());
if (op_id.index() <= min_index) {
// Do not update the bootstrap in-memory state for log records that have already been applied to
// RocksDB, or were overwritten by a later entry with a higher term that has already been
// applied to both regular and provisional record RocksDB.
replicate_entry_ptr->reset();
return Status::OK();
}
|
Summary: Enhance log-dump to provide a way to filter a WAL segment and remove a range of OpIds. This is needed for recovering clusters affected by #5003. Also adding a new flag --initial_log_segment_size_bytes for configuring the size of the first log segment (1MB by default). Previously, the default value of the initial log segment size was also 1MB and it was achieved by hard-coding 512KB as the default value of the current segment size and doubling it on every segment allocation, up to the maximum size. Now, we set current segment size to half of FLAGS_initial_log_segment_size_bytes and then double it at the very first log segment allocation to achieve the same effect, but with cleaner semantics for the user of the Log class. Test Plan: Filter the log from #5003. log-dump --filter_log_segment \ ~/log-dump-test/input/wal-000000007 \ --output_wal_dir ~/log-dump-test/output \ --min_op_term_to_omit 8 --min_op_index_to_omit 2108 \ --max_op_term_to_omit 8 --max_op_index_to_omit 2513 It should produce output similar to: I0710 18:24:57.233637 13345 log-dump.cc:249] Created directory ~/log-dump-test/output I0710 18:24:57.234838 13345 log-dump.cc:269] Source segment size ~/log-dump-test/input/wal-000000007: 67389360 bytes I0710 18:24:57.234859 13345 log-dump.cc:271] Target segment size: 74448896 bytes I0710 18:24:57.234988 13345 log-dump.cc:281] Will omit records between OpIds { term: 8 index: 2108 } and { term: 8 index: 2513 } (including the exact OpId matches). I0710 18:24:57.235208 13345 log.cc:819] T 91bf626f5794471ba61db14f07aaca8f P log-dump-tool: Setting log wal retention time to 0 seconds I0710 18:24:57.235327 13345 log.cc:429] interval_durable_wal_write_ms is turned on to sync every 1000 ms. I0710 18:24:58.092380 13345 log-dump.cc:319] Included 881 entries, omitted 406 entries I0710 18:24:58.330152 13345 log-dump.cc:326] Generated file ~/log-dump-test/output/index.000000000 I0710 18:24:58.330181 13345 log-dump.cc:326] Generated file ~/log-dump-test/output/wal-000000007 Look at the size of the new file. It should be 45974166 or something close. Then compare the output of log-dump on old and new files: diff <(./log-dump ~/log-dump-test/input/wal-000000007 | grep -v 'Tablet: 91bf626f5794471ba61db14f07aaca8f') \ <(./log-dump ~/log-dump-test/output/wal-000000007 | grep -v 'Tablet: 91bf626f5794471ba61db14f07aaca8f') The result should be similar to https://gist.githubusercontent.com/mbautin/5eccd75c5a4935ba6a760978fa76cf0b/raw Reviewers: hector, kannan, rahuldesirazu Reviewed By: rahuldesirazu Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8876
… incorrect entries on tablet bootstrap Summary: Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Ppart of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. #3759: potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. #4983: only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, sergei, rahuldesirazu Reviewed By: sergei Subscribers: sergei, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8917
…leader and replay of incorrect entries on tablet bootstrap Summary: This is a backport of e4e2d39 (D8917). Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Part of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. [#3759] Potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. [#4983] Only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins: urgent, patch: 2.1.8, auto rebase: no A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, rahuldesirazu, sergei Reviewed By: sergei Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8990
…leader and replay of incorrect entries on tablet bootstrap Summary: This is a backport of e4e2d39 (D8917). Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Part of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. [#3759] Potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. [#4983] Only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins: patch: 2.2.0 A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, rahuldesirazu, sergei Reviewed By: sergei Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8988
F20200707 23:25:17 ../../src/yb/tablet/mvcc.cc:398] T 91bf626f5794471ba61db14f07aaca8f P 621efb7a1e9544abbccb945b6b13d9be: T 91bf626f5794471ba61db14f07aaca8f P 621efb7a1e9544abbccb945b6b13d9be: Recent 2 MVCC operations:
New operation's hybrid time too low: { physical: 1593561737762828 }
max_safe_time_returned_with_lease_={ safe_time: source: kUnknown }
*ht < max_safe_time_returned_with_lease_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_with_lease_.safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff(max_safe_time_returned_with_lease_.safe_time)=1593561737762828
max_safe_time_returned_without_lease_={ safe_time: source: kUnknown }
*ht < max_safe_time_returned_without_lease_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_without_lease_.safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff(max_safe_time_returned_without_lease_.safe_time)=1593561737762828
max_safe_time_returned_for_follower_={ safe_time: source: kUnknown }
*ht < max_safe_time_returned_for_follower_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_for_follower_.safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff(max_safe_time_returned_for_follower_.safe_time)=1593561737762828
(SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown})={ safe_time: { physical: 1594065993286094 } source: kUnknown }
*ht < (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time=1
static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time.ToUint64())=-2065430623297536
ht->PhysicalDiff((SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time)=-504255523266
(SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown})={ safe_time: source: kUnknown }
*ht < (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time=0
static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff((SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time)=1593561737762828
is_follower_side=1
queue_.size()=0
queue_=[]
aborted=[]
@ 0x7f91cd7b9c0c yb::LogFatalHandlerSink::send()
@ 0x7f91cc9a1346 google::LogMessage::SendToLog()
@ 0x7f91cc99e7aa google::LogMessage::Flush()
@ 0x7f91cc9a1879 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f91d5e348fe yb::tablet::MvccManager::AddPending()
@ 0x7f91d5ddfdfb yb::tablet::TabletPeer::StartReplicaOperation()
@ 0x7f91d5a576f1 yb::consensus::RaftConsensus::StartReplicaOperationUnlocked()
@ 0x7f91d5a6098f yb::consensus::RaftConsensus::Start()
@ 0x7f91d5de1a83 yb::tablet::TabletPeer::Start()
@ 0x7f91d66e1ba8 yb::tserver::TSTabletManager::OpenTablet()
@ 0x7f91cd84c4a4 yb::ThreadPool::DispatchThread()
@ 0x7f91cd848e2f yb::Thread::SuperviseThread()
@ 0x7f91c807f694 start_thread
@ 0x7f91c77bc41d __clone
@ (nil) (unknown)
Aha! Link: https://yugabyte-test.aha.io/features/PLATFORM-181
The text was updated successfully, but these errors were encountered: