Hybrid time too low in RaftConsensus::Start #5003

mbautin · 2020-07-08T00:49:23Z

F20200707 23:25:17 ../../src/yb/tablet/mvcc.cc:398] T 91bf626f5794471ba61db14f07aaca8f P 621efb7a1e9544abbccb945b6b13d9be: T 91bf626f5794471ba61db14f07aaca8f P 621efb7a1e9544abbccb945b6b13d9be: Recent 2 MVCC operations:

SetLastReplicated { ht: { physical: 1594065993286094 } }
SetLeaderOnlyMode { leader_only: 0 }

New operation's hybrid time too low: { physical: 1593561737762828 }
max_safe_time_returned_with_lease_={ safe_time: source: kUnknown }
*ht < max_safe_time_returned_with_lease_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_with_lease_.safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff(max_safe_time_returned_with_lease_.safe_time)=1593561737762828

max_safe_time_returned_without_lease_={ safe_time: source: kUnknown }
*ht < max_safe_time_returned_without_lease_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_without_lease_.safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff(max_safe_time_returned_without_lease_.safe_time)=1593561737762828

max_safe_time_returned_for_follower_={ safe_time: source: kUnknown }
*ht < max_safe_time_returned_for_follower_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_for_follower_.safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff(max_safe_time_returned_for_follower_.safe_time)=1593561737762828

(SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown})={ safe_time: { physical: 1594065993286094 } source: kUnknown }
*ht < (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time=1
static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time.ToUint64())=-2065430623297536
ht->PhysicalDiff((SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time)=-504255523266

(SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown})={ safe_time: source: kUnknown }
*ht < (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time=0
static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff((SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time)=1593561737762828

is_follower_side=1
queue_.size()=0
queue_=[]
aborted=[]
@ 0x7f91cd7b9c0c yb::LogFatalHandlerSink::send()
@ 0x7f91cc9a1346 google::LogMessage::SendToLog()
@ 0x7f91cc99e7aa google::LogMessage::Flush()
@ 0x7f91cc9a1879 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f91d5e348fe yb::tablet::MvccManager::AddPending()
@ 0x7f91d5ddfdfb yb::tablet::TabletPeer::StartReplicaOperation()
@ 0x7f91d5a576f1 yb::consensus::RaftConsensus::StartReplicaOperationUnlocked()
@ 0x7f91d5a6098f yb::consensus::RaftConsensus::Start()
@ 0x7f91d5de1a83 yb::tablet::TabletPeer::Start()
@ 0x7f91d66e1ba8 yb::tserver::TSTabletManager::OpenTablet()
@ 0x7f91cd84c4a4 yb::ThreadPool::DispatchThread()
@ 0x7f91cd848e2f yb::Thread::SuperviseThread()
@ 0x7f91c807f694 start_thread
@ 0x7f91c77bc41d __clone
@ (nil) (unknown)

Aha! Link: https://yugabyte-test.aha.io/features/PLATFORM-181

mbautin · 2020-07-08T18:49:02Z

Timestamp of the last replicated record:

>>> datetime.utcfromtimestamp(1594065993.286094).strftime('%Y-%m-%d %H:%M:%S.%f')
'2020-07-06 20:06:33.286094'

New operation's hybrid time:

>>> datetime.utcfromtimestamp(1593561737.762828).strftime('%Y-%m-%d %H:%M:%S.%f')
'2020-07-01 00:02:17.762828'

In RaftConsensus::Start we do this:

  {
    ReplicaState::UniqueLock lock;
    RETURN_NOT_OK(state_->LockForStart(&lock));
    state_->ClearLeaderUnlocked();

    RETURN_NOT_OK_PREPEND(state_->StartUnlocked(info.last_id),
                          "Unable to start RAFT ReplicaState");

    LOG_WITH_PREFIX(INFO) << "Replica starting. Triggering "
                          << info.orphaned_replicates.size()
                          << " pending operations. Active config: "
                          << state_->GetActiveConfigUnlocked().ShortDebugString();
    for (const auto& replicate : info.orphaned_replicates) {
      ReplicateMsgPtr replicate_ptr = std::make_shared<ReplicateMsg>(*replicate);
      RETURN_NOT_OK(StartReplicaOperationUnlocked(replicate_ptr, HybridTime::kInvalid));
    }

    RETURN_NOT_OK(state_->InitCommittedOpIdUnlocked(yb::OpId::FromPB(info.last_committed_id)));

    queue_->Init(state_->GetLastReceivedOpIdUnlocked().ToPB<OpId>());
  }

(see the call to StartReplicaOperationUnlocked).

mbautin · 2020-07-08T18:54:05Z

YugabyteDB version:
version 2.1.8.2 build 1 revision 7724903 build_type RELEASE built at 18 Jun 2020 23:42:18 UTC

mbautin · 2020-07-08T19:15:37Z

Looking at the WAL dumps. wal-000000007 and wal-000000008 are the files containing some data before the crash, and wal-000000009 and beyond were generated by repeated restarts.

Trying to locate the new operation in the WAL.
( log-dump --print_entries=pb wal-000000007 | grep hybrid_time | awk '{print $NF}'; echo "$(( 1593561737762828 * 4096 )) <-- NEW OPERATION" ) | sort -n

This yields the following:

...
6527228877805551616
6527228877805572096
6527228877805588480
6527228877809836032
6527228877809905664
6527228877809926144
6527228877809942528
6527228877809979392
6527228877810008064
6527228877810024448
6527228877876482048
6527228877876543488
6527228877876543488 <-- NEW OPERATION
6527228877876568064
6527228877876580352
6527228877876592640
6527228877879648256
6527228877879689216
6527228877883375616
6527228877883408384
6527228877883432960
6527228877883445248
6527228877883457536
6527228877883478016
...

mbautin · 2020-07-08T19:37:23Z

A slightly different way to look at the log (skipping the application data):

log-dump --print_entries=pb wal-000000007 | grep -v '^      '

The problematic entry with the entries before and after:

Entry:
type: REPLICATE
replicate {
  id {
    term: 8
    index: 2399
  }
  hybrid_time: 6527228877876482048
  op_type: WRITE_OP
  write_request {
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    write_batch {
    }
    client_id1: 12700506567646334671
    client_id2: 17721724135354447017
    request_id: 830
    min_running_request_id: 543
    batch_idx: 18446744073709551615
  }
  committed_op_id {
    term: 8
    index: 2107
  }
  monotonic_counter: 0
}
Entry:
type: REPLICATE
replicate {
  id {
    term: 8
    index: 2400
  }
  hybrid_time: 6527228877876543488 <-- PROBLEMATIC ENTRY
  op_type: WRITE_OP
  write_request {
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    write_batch {
    }
    client_id1: 16303715820282376196
    client_id2: 14161383940175931839
    request_id: 672
    min_running_request_id: 579
    batch_idx: 18446744073709551615
  }
  committed_op_id {
    term: 8
    index: 2107
  }
  monotonic_counter: 0
}
Entry:
type: REPLICATE
replicate {
  id {
    term: 8
    index: 2401
  }
  hybrid_time: 6527228877876568064
  op_type: WRITE_OP
  write_request {
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    write_batch {
    }
    client_id1: 16303715820282376196
    client_id2: 14161383940175931839
    request_id: 796
    min_running_request_id: 579
    batch_idx: 18446744073709551615
  }
  committed_op_id {
    term: 8
    index: 2107
  }
  monotonic_counter: 0
}

mbautin · 2020-07-08T19:42:45Z

A way to look at only op ids and committed op ids:

log-dump --print_entries=pb wal-000000007 | grep -v '^      ' | grep -EA 3 'id {|committed_op_id {'

It is clear that the committed op id gets stuck at 8.2107 for a long time:

--
  id {
    term: 8
    index: 2400
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
  id {
    term: 8
    index: 2401
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
  id {
    term: 8
    index: 2402
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
. . .
--
  id {
    term: 8
    index: 2512
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
  id {
    term: 8
    index: 2513
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
  id {
    term: 9
    index: 2108
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
  id {
    term: 9
    index: 2109
  }
--
  committed_op_id {
    term: 9
    index: 2108
  }
--

Eventually, term 9's leader replicates 9.2103, apparently overwriting entries 8.2108 to 8.2513.

mbautin · 2020-07-08T20:11:23Z

When looking the next log file (wal-000000009), it turns out that the last entry in the log as an index of 2399 (in term 17).

bin/log-dump --print_entries=pb wal-000000009 | grep -v '^      '

I see the two last entries

Entry:
type: REPLICATE
replicate {
  id {
    term: 17
    index: 2398
  }
  hybrid_time: 6529294250136178688
  op_type: SNAPSHOT_OP
  committed_op_id {
    term: 17
    index: 2397
  }
  monotonic_counter: 0
  snapshot_request {
    dest_uuid: "99fa659e5c5a4437b80439e092981b5b"
    operation: CREATE_ON_TABLET
    snapshot_id: "R1-\275\276\342D\013\217\322\003\226z\014\217\373"
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    propagated_hybrid_time: 6529294249940017152
    snapshot_hybrid_time: 6529294250133237760
  }
}
Entry:
type: REPLICATE
replicate {
  id {
    term: 17
    index: 2399
  }
  hybrid_time: 6529294308499841024
  op_type: SNAPSHOT_OP
  committed_op_id {
    term: 17
    index: 2398
  }
  monotonic_counter: 0
  snapshot_request {
    dest_uuid: "99fa659e5c5a4437b80439e092981b5b"
    operation: DELETE_ON_TABLET
    snapshot_id: "R1-\275\276\342D\013\217\322\003\226z\014\217\373"
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    propagated_hybrid_time: 6529294308497375232
  }
}

We have this code in tablet bootstrap which is supposed to erase entries that have been overwritten by entries from later terms from pending_replicates.

  auto iter = replay_state_->pending_replicates.lower_bound(op_id.index());

  // If there was a entry with the same index we're overwriting then we need to delete that entry
  // and all entries with higher indexes.
  if (iter != replay_state_->pending_replicates.end() && iter->first == op_id.index()) {
    auto& existing_entry = iter->second;
    auto& last_entry = replay_state_->pending_replicates.rbegin()->second;

    LOG_WITH_PREFIX(INFO) << "Overwriting operations starting at: "
                          << existing_entry.entry->replicate().id()
                          << " up to: " << last_entry.entry->replicate().id()
                          << " with operation: " << replicate.id();
    stats_.ops_overwritten += std::distance(iter, replay_state_->pending_replicates.end());
    replay_state_->pending_replicates.erase(iter, replay_state_->pending_replicates.end());
  }

When processing entry 9.2108 (the NO OP entry replicated by term 9's leader)

Entry:
type: REPLICATE
replicate {
  id {
    term: 9
    index: 2108
  }
  hybrid_time: 6527228884769497088
  op_type: NO_OP
  committed_op_id {
    term: 8
    index: 2107
  }
  noop_request {
  }
}

it is supposed to have erased older entries with indexes 2108 and higher (8.2108 through 8.2513 inclusively, also erasing the operation 8.2400, which is clearly is not happening, because that operation is causing the crash later).

mbautin · 2020-07-08T22:58:15Z

Attaching full output of log-dump --print_entries=pb ... | grep -v '^ ' with schema removed. The links below only contain the metadata potentially relevant to investigating this issue.

wal-000000007: https://gist.githubusercontent.com/mbautin/6246ae8095011d6fbbc9389f2fa308dc/raw
wal-000000008: https://gist.githubusercontent.com/mbautin/ec75cfce99d5de734176c112edf1530a/raw

mbautin · 2020-07-08T23:44:18Z

And I will also mention, just in case, that the last replicated hybrid time is 1594065993286094 * 4096 = 6529294308499841024 (we use 12 bits for the logical time component), which corresponds to the last entry from wal-000000008:

Entry:
type: REPLICATE
replicate {
  id {
    term: 17
    index: 2399
  }
  hybrid_time: 6529294308499841024
  op_type: SNAPSHOT_OP
  committed_op_id {
    term: 17
    index: 2398
  }
  monotonic_counter: 0
  snapshot_request {
    dest_uuid: "99fa659e5c5a4437b80439e092981b5b"
    operation: DELETE_ON_TABLET
    snapshot_id: "R1-\275\276\342D\013\217\322\003\226z\014\217\373"
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    propagated_hybrid_time: 6529294308497375232
  }
}

mbautin · 2020-07-09T00:14:22Z

Analyzing the data from the tablet server log. Flushed OpIds are 17.2399 for both regular and intents RocksDBs.

I0708 14:37:55.957229 8517 version_set.cc:2818] T 91bf626f5794471ba61db14f07aaca8f P 621efb7a1e9544abbccb945b6b13d9be [R]: Recovered from manifest file:/mnt/disk1/yb-data/tserver/data/rocksdb/table-e7bd70eb6f3f4051b90c9effb2f7352c/tablet-91bf626f5794471ba61db14f07aaca8f/MANIFEST-001767 succeeded,manifest_file_number is 1767, next_file_number is 1769, last_sequence is 1125899910856910, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is 0x000000002e9bf350 -> { op_id: 17.2399 hybrid_time: { physical: 1594065993286094 } history_cutoff: { physical: 1594065867642616 } hybrid_time_filter: }
I0708 14:37:55.962592 8517 version_set.cc:2818] T 91bf626f5794471ba61db14f07aaca8f P 621efb7a1e9544abbccb945b6b13d9be [I]: Recovered from manifest file:/mnt/disk1/yb-data/tserver/data/rocksdb/table-e7bd70eb6f3f4051b90c9effb2f7352c/tablet-91bf626f5794471ba61db14f07aaca8f.intents/MANIFEST-001761 succeeded,manifest_file_number is 1761, next_file_number is 1763, last_sequence is 1125899906842624, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is 0x000000002e9bf530 -> { op_id: 17.2399 hybrid_time: { physical: 1594065993286094 } history_cutoff: hybrid_time_filter: }

mbautin · 2020-07-09T05:04:07Z

So the reason we do not get to the code that removes records 8.2108 up to and including 8.2513 from pending_replicates when 9.2109 is encountered (

yugabyte-db/src/yb/tablet/tablet_bootstrap.cc

Line 705 in 7724903

    
           if (iter != replay_state_->pending_replicates.end() && iter->first == op_id.index()) {

) is this earlier part of the same function (TabletBootstrap::HandleReplicateMessage):

  auto min_index = std::min(
      replay_state_->regular_stored_op_id.index(), replay_state_->intents_stored_op_id.index());

  if (op_id.index() <= min_index) {
    // Do not update the bootstrap in-memory state for log records that have already been applied to
    // RocksDB, or were overwritten by a later entry with a higher term that has already been
    // applied to both regular and provisional record RocksDB.
    replicate_entry_ptr->reset();
    return Status::OK();
  }

min_index is 2399 in this case, so we simply keep those records in the pending_replicates at that point.

Summary: Enhance log-dump to provide a way to filter a WAL segment and remove a range of OpIds. This is needed for recovering clusters affected by #5003. Also adding a new flag --initial_log_segment_size_bytes for configuring the size of the first log segment (1MB by default). Previously, the default value of the initial log segment size was also 1MB and it was achieved by hard-coding 512KB as the default value of the current segment size and doubling it on every segment allocation, up to the maximum size. Now, we set current segment size to half of FLAGS_initial_log_segment_size_bytes and then double it at the very first log segment allocation to achieve the same effect, but with cleaner semantics for the user of the Log class. Test Plan: Filter the log from #5003. log-dump --filter_log_segment \ ~/log-dump-test/input/wal-000000007 \ --output_wal_dir ~/log-dump-test/output \ --min_op_term_to_omit 8 --min_op_index_to_omit 2108 \ --max_op_term_to_omit 8 --max_op_index_to_omit 2513 It should produce output similar to: I0710 18:24:57.233637 13345 log-dump.cc:249] Created directory ~/log-dump-test/output I0710 18:24:57.234838 13345 log-dump.cc:269] Source segment size ~/log-dump-test/input/wal-000000007: 67389360 bytes I0710 18:24:57.234859 13345 log-dump.cc:271] Target segment size: 74448896 bytes I0710 18:24:57.234988 13345 log-dump.cc:281] Will omit records between OpIds { term: 8 index: 2108 } and { term: 8 index: 2513 } (including the exact OpId matches). I0710 18:24:57.235208 13345 log.cc:819] T 91bf626f5794471ba61db14f07aaca8f P log-dump-tool: Setting log wal retention time to 0 seconds I0710 18:24:57.235327 13345 log.cc:429] interval_durable_wal_write_ms is turned on to sync every 1000 ms. I0710 18:24:58.092380 13345 log-dump.cc:319] Included 881 entries, omitted 406 entries I0710 18:24:58.330152 13345 log-dump.cc:326] Generated file ~/log-dump-test/output/index.000000000 I0710 18:24:58.330181 13345 log-dump.cc:326] Generated file ~/log-dump-test/output/wal-000000007 Look at the size of the new file. It should be 45974166 or something close. Then compare the output of log-dump on old and new files: diff <(./log-dump ~/log-dump-test/input/wal-000000007 | grep -v 'Tablet: 91bf626f5794471ba61db14f07aaca8f') \ <(./log-dump ~/log-dump-test/output/wal-000000007 | grep -v 'Tablet: 91bf626f5794471ba61db14f07aaca8f') The result should be similar to https://gist.githubusercontent.com/mbautin/5eccd75c5a4935ba6a760978fa76cf0b/raw Reviewers: hector, kannan, rahuldesirazu Reviewed By: rahuldesirazu Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8876

… incorrect entries on tablet bootstrap Summary: Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Ppart of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. #3759: potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. #4983: only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, sergei, rahuldesirazu Reviewed By: sergei Subscribers: sergei, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8917

…leader and replay of incorrect entries on tablet bootstrap Summary: This is a backport of e4e2d39 (D8917). Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Part of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. [#3759] Potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. [#4983] Only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins: urgent, patch: 2.1.8, auto rebase: no A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, rahuldesirazu, sergei Reviewed By: sergei Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8990

…leader and replay of incorrect entries on tablet bootstrap Summary: This is a backport of e4e2d39 (D8917). Fixing various bugs in tablet bootstrap. #5003: WAL tail is not being overwritten correctly by a new leader ================================================================== We can get into a situation when part of the WAL with indexes higher than the last index flushed into RocksDB stays in pending_replicates by the end of tablet bootstrap. Appending a future entry will fail in this case, as happens in #5003. This bug happens because we don't process WAL tail overwrites caused by entries whose index is less than or equal to the last flushed index. The solution is to always process these overwrites, even if the newly added entry's index has no exact match in pending_replicates, and remove entries with higher indexes. Treating entries that were never committed as committed ======================================================= Part of tablet bootstrap logic that is supposed to only run on committed entries could sometimes run on entries that were never committed. This was possible because we still had an old way to recover the comitted OpId from using a Raft index-only match of flushed OpId to the entry being processed. This is a leftover from the time when we were setting RocksDB sequence numbers to Raft indexes. E.g. suppose entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during bootstrap, we would consider it committed, even though it might have been overwritten later by entry 2.5, because its index 10 matches that of the flushed OpId. We probably would not apply such potentially overwritten entries to RocksDB anyway as its index is too low and we have more checks later in the process. However, this outdated logic is not needed, because we have the "user frontier" / "consensus frontier" mechanism to keep track of last OpIds flushed in to RocksDBs. Not registering flushed operations with RetryableRequests ========================================================= RetryableRequests is a subsystem introduced in 023c20a for ignoring duplicate write requests to a tablet. It is important for consistency, and ensures that retries at a higher layer do not cause multiple writes. In tablet bootstrap, we need to register committed operations that happened during last FLAGS_retryable_request_timeout_secs seconds (120 seconds by default) with RetryableRequests, whether or not those operations are flushed. (Uncommitted operations are supposed to be passed to the consensus layer as "orphaned replicates" and be registered with RetryableRequests there.) Prior to this diff, operations with index lower than the flushed index were not register with RetryableRequests for transactional tables. [#3759] Potentially skipping too many segments with skip_wal_rewrite =================================================================== We were starting replay from the last segment where the committed OpId of the first entry was less than or equal to the flushed OpId.Suppose segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment 2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even though it contains entries we need. Instead of using committed OpId, we need to be selecting the first segment to replay as the last segment where the actual OpId of the first entry is less than or equal to the flushed OpId. [#4983] Only create a new log segment after successful bootstrap =============================================================== In the skip_wal_rewrite mode we are always creating a new log segment before starting log replay. Then log replay picks up the new segment, finds a "corruption" in it, and has to skip it. If a tablet server gets into a restart loop due to tablet bootstrap failure, hundreds of new empty log segments may get created quickly. It is much cleaner to only create a new log segment after successful tablet bootstrap in the skip_wal_rewrite mode (which is now the default mode). Also as part of doing this, I simplified the ReadFirstEntryMetadata function by delegating to the function ReadEntries with an additional parameter to just read one entry. Refactoring =========== Reorganizing the tablet bootstrap code to fully hide the TabletBootstrap class in the implementation file. Other improvements ================== Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ. ASSERT_VECTORS_EQ reports differences between element sets of two vectors in case they do not match, in addition to regular output, making it easier to spot differences in many cases. Documenting intended usage for YB_STRUCT_TO_STRING and YB_CLASS_TO_STRING macros. Test Plan: Jenkins: patch: 2.2.0 A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail. Reviewers: amitanand, rahuldesirazu, sergei Reviewed By: sergei Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8988

mbautin added the area/docdb YugabyteDB core features label Jul 8, 2020

mbautin self-assigned this Jul 8, 2020

mbautin mentioned this issue Jul 16, 2020

Tablet bootstrap crashing, new operation 16 hours behind last replicated operation #4968

Closed

mbautin closed this as completed Aug 11, 2020

mbautin mentioned this issue Aug 11, 2020

DumpReplayStateToStrings should handle too many entries scenario #5345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid time too low in RaftConsensus::Start #5003

Hybrid time too low in RaftConsensus::Start #5003

mbautin commented Jul 8, 2020 •

edited by chirag-yb

Loading

mbautin commented Jul 8, 2020

mbautin commented Jul 8, 2020

mbautin commented Jul 8, 2020 •

edited

Loading

mbautin commented Jul 8, 2020

mbautin commented Jul 8, 2020

mbautin commented Jul 8, 2020 •

edited

Loading

mbautin commented Jul 8, 2020

mbautin commented Jul 8, 2020

mbautin commented Jul 9, 2020 •

edited

Loading

mbautin commented Jul 9, 2020

Hybrid time too low in RaftConsensus::Start #5003

Hybrid time too low in RaftConsensus::Start #5003

Comments

mbautin commented Jul 8, 2020 • edited by chirag-yb Loading

mbautin commented Jul 8, 2020

mbautin commented Jul 8, 2020

mbautin commented Jul 8, 2020 • edited Loading

mbautin commented Jul 8, 2020

mbautin commented Jul 8, 2020

mbautin commented Jul 8, 2020 • edited Loading

mbautin commented Jul 8, 2020

mbautin commented Jul 8, 2020

mbautin commented Jul 9, 2020 • edited Loading

mbautin commented Jul 9, 2020

mbautin commented Jul 8, 2020 •

edited by chirag-yb

Loading

mbautin commented Jul 8, 2020 •

edited

Loading

mbautin commented Jul 8, 2020 •

edited

Loading

mbautin commented Jul 9, 2020 •

edited

Loading