Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hybrid time too low in RaftConsensus::Start #5003

Closed
mbautin opened this issue Jul 8, 2020 · 10 comments
Closed

Hybrid time too low in RaftConsensus::Start #5003

mbautin opened this issue Jul 8, 2020 · 10 comments
Assignees
Labels
area/docdb YugabyteDB core features

Comments

@mbautin
Copy link
Contributor

mbautin commented Jul 8, 2020

F20200707 23:25:17 ../../src/yb/tablet/mvcc.cc:398] T 91bf626f5794471ba61db14f07aaca8f P 621efb7a1e9544abbccb945b6b13d9be: T 91bf626f5794471ba61db14f07aaca8f P 621efb7a1e9544abbccb945b6b13d9be: Recent 2 MVCC operations:

  1. SetLastReplicated { ht: { physical: 1594065993286094 } }
  2. SetLeaderOnlyMode { leader_only: 0 }

New operation's hybrid time too low: { physical: 1593561737762828 }
max_safe_time_returned_with_lease_={ safe_time: source: kUnknown }
*ht < max_safe_time_returned_with_lease_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_with_lease_.safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff(max_safe_time_returned_with_lease_.safe_time)=1593561737762828

max_safe_time_returned_without_lease_={ safe_time: source: kUnknown }
*ht < max_safe_time_returned_without_lease_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_without_lease_.safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff(max_safe_time_returned_without_lease_.safe_time)=1593561737762828

max_safe_time_returned_for_follower_={ safe_time: source: kUnknown }
*ht < max_safe_time_returned_for_follower_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_for_follower_.safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff(max_safe_time_returned_for_follower_.safe_time)=1593561737762828

(SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown})={ safe_time: { physical: 1594065993286094 } source: kUnknown }
*ht < (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time=1
static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time.ToUint64())=-2065430623297536
ht->PhysicalDiff((SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time)=-504255523266

(SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown})={ safe_time: source: kUnknown }
*ht < (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time=0
static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time.ToUint64())=6527228877876543488
ht->PhysicalDiff((SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time)=1593561737762828

is_follower_side=1
queue_.size()=0
queue_=[]
aborted=[]
@ 0x7f91cd7b9c0c yb::LogFatalHandlerSink::send()
@ 0x7f91cc9a1346 google::LogMessage::SendToLog()
@ 0x7f91cc99e7aa google::LogMessage::Flush()
@ 0x7f91cc9a1879 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f91d5e348fe yb::tablet::MvccManager::AddPending()
@ 0x7f91d5ddfdfb yb::tablet::TabletPeer::StartReplicaOperation()
@ 0x7f91d5a576f1 yb::consensus::RaftConsensus::StartReplicaOperationUnlocked()
@ 0x7f91d5a6098f yb::consensus::RaftConsensus::Start()
@ 0x7f91d5de1a83 yb::tablet::TabletPeer::Start()
@ 0x7f91d66e1ba8 yb::tserver::TSTabletManager::OpenTablet()
@ 0x7f91cd84c4a4 yb::ThreadPool::DispatchThread()
@ 0x7f91cd848e2f yb::Thread::SuperviseThread()
@ 0x7f91c807f694 start_thread
@ 0x7f91c77bc41d __clone
@ (nil) (unknown)

Aha! Link: https://yugabyte-test.aha.io/features/PLATFORM-181

@mbautin mbautin added the area/docdb YugabyteDB core features label Jul 8, 2020
@mbautin mbautin self-assigned this Jul 8, 2020
@mbautin
Copy link
Contributor Author

mbautin commented Jul 8, 2020

Timestamp of the last replicated record:

>>> datetime.utcfromtimestamp(1594065993.286094).strftime('%Y-%m-%d %H:%M:%S.%f')
'2020-07-06 20:06:33.286094'

New operation's hybrid time:

>>> datetime.utcfromtimestamp(1593561737.762828).strftime('%Y-%m-%d %H:%M:%S.%f')
'2020-07-01 00:02:17.762828'

In RaftConsensus::Start we do this:

  {
    ReplicaState::UniqueLock lock;
    RETURN_NOT_OK(state_->LockForStart(&lock));
    state_->ClearLeaderUnlocked();

    RETURN_NOT_OK_PREPEND(state_->StartUnlocked(info.last_id),
                          "Unable to start RAFT ReplicaState");

    LOG_WITH_PREFIX(INFO) << "Replica starting. Triggering "
                          << info.orphaned_replicates.size()
                          << " pending operations. Active config: "
                          << state_->GetActiveConfigUnlocked().ShortDebugString();
    for (const auto& replicate : info.orphaned_replicates) {
      ReplicateMsgPtr replicate_ptr = std::make_shared<ReplicateMsg>(*replicate);
      RETURN_NOT_OK(StartReplicaOperationUnlocked(replicate_ptr, HybridTime::kInvalid));
    }

    RETURN_NOT_OK(state_->InitCommittedOpIdUnlocked(yb::OpId::FromPB(info.last_committed_id)));

    queue_->Init(state_->GetLastReceivedOpIdUnlocked().ToPB<OpId>());
  }

(see the call to StartReplicaOperationUnlocked).

@mbautin
Copy link
Contributor Author

mbautin commented Jul 8, 2020

YugabyteDB version:
version 2.1.8.2 build 1 revision 7724903 build_type RELEASE built at 18 Jun 2020 23:42:18 UTC

@mbautin
Copy link
Contributor Author

mbautin commented Jul 8, 2020

Looking at the WAL dumps. wal-000000007 and wal-000000008 are the files containing some data before the crash, and wal-000000009 and beyond were generated by repeated restarts.

Trying to locate the new operation in the WAL.
( log-dump --print_entries=pb wal-000000007 | grep hybrid_time | awk '{print $NF}'; echo "$(( 1593561737762828 * 4096 )) <-- NEW OPERATION" ) | sort -n

This yields the following:

...
6527228877805551616
6527228877805572096
6527228877805588480
6527228877809836032
6527228877809905664
6527228877809926144
6527228877809942528
6527228877809979392
6527228877810008064
6527228877810024448
6527228877876482048
6527228877876543488
6527228877876543488 <-- NEW OPERATION
6527228877876568064
6527228877876580352
6527228877876592640
6527228877879648256
6527228877879689216
6527228877883375616
6527228877883408384
6527228877883432960
6527228877883445248
6527228877883457536
6527228877883478016
...

@mbautin
Copy link
Contributor Author

mbautin commented Jul 8, 2020

A slightly different way to look at the log (skipping the application data):

log-dump --print_entries=pb wal-000000007 | grep -v '^      '

The problematic entry with the entries before and after:

Entry:
type: REPLICATE
replicate {
  id {
    term: 8
    index: 2399
  }
  hybrid_time: 6527228877876482048
  op_type: WRITE_OP
  write_request {
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    write_batch {
    }
    client_id1: 12700506567646334671
    client_id2: 17721724135354447017
    request_id: 830
    min_running_request_id: 543
    batch_idx: 18446744073709551615
  }
  committed_op_id {
    term: 8
    index: 2107
  }
  monotonic_counter: 0
}
Entry:
type: REPLICATE
replicate {
  id {
    term: 8
    index: 2400
  }
  hybrid_time: 6527228877876543488 <-- PROBLEMATIC ENTRY
  op_type: WRITE_OP
  write_request {
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    write_batch {
    }
    client_id1: 16303715820282376196
    client_id2: 14161383940175931839
    request_id: 672
    min_running_request_id: 579
    batch_idx: 18446744073709551615
  }
  committed_op_id {
    term: 8
    index: 2107
  }
  monotonic_counter: 0
}
Entry:
type: REPLICATE
replicate {
  id {
    term: 8
    index: 2401
  }
  hybrid_time: 6527228877876568064
  op_type: WRITE_OP
  write_request {
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    write_batch {
    }
    client_id1: 16303715820282376196
    client_id2: 14161383940175931839
    request_id: 796
    min_running_request_id: 579
    batch_idx: 18446744073709551615
  }
  committed_op_id {
    term: 8
    index: 2107
  }
  monotonic_counter: 0
}

@mbautin
Copy link
Contributor Author

mbautin commented Jul 8, 2020

A way to look at only op ids and committed op ids:

log-dump --print_entries=pb wal-000000007 | grep -v '^      ' | grep -EA 3 'id {|committed_op_id {'

It is clear that the committed op id gets stuck at 8.2107 for a long time:

--
  id {
    term: 8
    index: 2400
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
  id {
    term: 8
    index: 2401
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
  id {
    term: 8
    index: 2402
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
. . .
--
  id {
    term: 8
    index: 2512
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
  id {
    term: 8
    index: 2513
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
  id {
    term: 9
    index: 2108
  }
--
  committed_op_id {
    term: 8
    index: 2107
  }
--
  id {
    term: 9
    index: 2109
  }
--
  committed_op_id {
    term: 9
    index: 2108
  }
--

Eventually, term 9's leader replicates 9.2103, apparently overwriting entries 8.2108 to 8.2513.

@mbautin
Copy link
Contributor Author

mbautin commented Jul 8, 2020

When looking the next log file (wal-000000009), it turns out that the last entry in the log as an index of 2399 (in term 17).

bin/log-dump --print_entries=pb wal-000000009 | grep -v '^      '

I see the two last entries

Entry:
type: REPLICATE
replicate {
  id {
    term: 17
    index: 2398
  }
  hybrid_time: 6529294250136178688
  op_type: SNAPSHOT_OP
  committed_op_id {
    term: 17
    index: 2397
  }
  monotonic_counter: 0
  snapshot_request {
    dest_uuid: "99fa659e5c5a4437b80439e092981b5b"
    operation: CREATE_ON_TABLET
    snapshot_id: "R1-\275\276\342D\013\217\322\003\226z\014\217\373"
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    propagated_hybrid_time: 6529294249940017152
    snapshot_hybrid_time: 6529294250133237760
  }
}
Entry:
type: REPLICATE
replicate {
  id {
    term: 17
    index: 2399
  }
  hybrid_time: 6529294308499841024
  op_type: SNAPSHOT_OP
  committed_op_id {
    term: 17
    index: 2398
  }
  monotonic_counter: 0
  snapshot_request {
    dest_uuid: "99fa659e5c5a4437b80439e092981b5b"
    operation: DELETE_ON_TABLET
    snapshot_id: "R1-\275\276\342D\013\217\322\003\226z\014\217\373"
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    propagated_hybrid_time: 6529294308497375232
  }
}

We have this code in tablet bootstrap which is supposed to erase entries that have been overwritten by entries from later terms from pending_replicates.

  auto iter = replay_state_->pending_replicates.lower_bound(op_id.index());

  // If there was a entry with the same index we're overwriting then we need to delete that entry
  // and all entries with higher indexes.
  if (iter != replay_state_->pending_replicates.end() && iter->first == op_id.index()) {
    auto& existing_entry = iter->second;
    auto& last_entry = replay_state_->pending_replicates.rbegin()->second;

    LOG_WITH_PREFIX(INFO) << "Overwriting operations starting at: "
                          << existing_entry.entry->replicate().id()
                          << " up to: " << last_entry.entry->replicate().id()
                          << " with operation: " << replicate.id();
    stats_.ops_overwritten += std::distance(iter, replay_state_->pending_replicates.end());
    replay_state_->pending_replicates.erase(iter, replay_state_->pending_replicates.end());
  }

When processing entry 9.2108 (the NO OP entry replicated by term 9's leader)

Entry:
type: REPLICATE
replicate {
  id {
    term: 9
    index: 2108
  }
  hybrid_time: 6527228884769497088
  op_type: NO_OP
  committed_op_id {
    term: 8
    index: 2107
  }
  noop_request {
  }
}

it is supposed to have erased older entries with indexes 2108 and higher (8.2108 through 8.2513 inclusively, also erasing the operation 8.2400, which is clearly is not happening, because that operation is causing the crash later).

@mbautin
Copy link
Contributor Author

mbautin commented Jul 8, 2020

Attaching full output of log-dump --print_entries=pb ... | grep -v '^ ' with schema removed. The links below only contain the metadata potentially relevant to investigating this issue.

@mbautin
Copy link
Contributor Author

mbautin commented Jul 8, 2020

And I will also mention, just in case, that the last replicated hybrid time is 1594065993286094 * 4096 = 6529294308499841024 (we use 12 bits for the logical time component), which corresponds to the last entry from wal-000000008:

Entry:
type: REPLICATE
replicate {
  id {
    term: 17
    index: 2399
  }
  hybrid_time: 6529294308499841024
  op_type: SNAPSHOT_OP
  committed_op_id {
    term: 17
    index: 2398
  }
  monotonic_counter: 0
  snapshot_request {
    dest_uuid: "99fa659e5c5a4437b80439e092981b5b"
    operation: DELETE_ON_TABLET
    snapshot_id: "R1-\275\276\342D\013\217\322\003\226z\014\217\373"
    tablet_id: "91bf626f5794471ba61db14f07aaca8f"
    propagated_hybrid_time: 6529294308497375232
  }
}

@mbautin
Copy link
Contributor Author

mbautin commented Jul 9, 2020

Analyzing the data from the tablet server log. Flushed OpIds are 17.2399 for both regular and intents RocksDBs.

I0708 14:37:55.957229 8517 version_set.cc:2818] T 91bf626f5794471ba61db14f07aaca8f P 621efb7a1e9544abbccb945b6b13d9be [R]: Recovered from manifest file:/mnt/disk1/yb-data/tserver/data/rocksdb/table-e7bd70eb6f3f4051b90c9effb2f7352c/tablet-91bf626f5794471ba61db14f07aaca8f/MANIFEST-001767 succeeded,manifest_file_number is 1767, next_file_number is 1769, last_sequence is 1125899910856910, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is 0x000000002e9bf350 -> { op_id: 17.2399 hybrid_time: { physical: 1594065993286094 } history_cutoff: { physical: 1594065867642616 } hybrid_time_filter: }
I0708 14:37:55.962592 8517 version_set.cc:2818] T 91bf626f5794471ba61db14f07aaca8f P 621efb7a1e9544abbccb945b6b13d9be [I]: Recovered from manifest file:/mnt/disk1/yb-data/tserver/data/rocksdb/table-e7bd70eb6f3f4051b90c9effb2f7352c/tablet-91bf626f5794471ba61db14f07aaca8f.intents/MANIFEST-001761 succeeded,manifest_file_number is 1761, next_file_number is 1763, last_sequence is 1125899906842624, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is 0x000000002e9bf530 -> { op_id: 17.2399 hybrid_time: { physical: 1594065993286094 } history_cutoff: hybrid_time_filter: }

@mbautin
Copy link
Contributor Author

mbautin commented Jul 9, 2020

So the reason we do not get to the code that removes records 8.2108 up to and including 8.2513 from pending_replicates when 9.2109 is encountered (

if (iter != replay_state_->pending_replicates.end() && iter->first == op_id.index()) {
) is this earlier part of the same function (TabletBootstrap::HandleReplicateMessage):

  auto min_index = std::min(
      replay_state_->regular_stored_op_id.index(), replay_state_->intents_stored_op_id.index());

  if (op_id.index() <= min_index) {
    // Do not update the bootstrap in-memory state for log records that have already been applied to
    // RocksDB, or were overwritten by a later entry with a higher term that has already been
    // applied to both regular and provisional record RocksDB.
    replicate_entry_ptr->reset();
    return Status::OK();
  }

min_index is 2399 in this case, so we simply keep those records in the pending_replicates at that point.

mbautin added a commit that referenced this issue Jul 16, 2020
Summary:
Enhance log-dump to provide a way to filter a WAL segment and remove a range of OpIds. This is
needed for recovering clusters affected by #5003.

Also adding a new flag --initial_log_segment_size_bytes for configuring the size of the first
log segment (1MB by default). Previously, the default value of the initial log segment size
was also 1MB and it was achieved by hard-coding 512KB as the default value of the current
segment size and doubling it on every segment allocation, up to the maximum size. Now, we
set current segment size to half of FLAGS_initial_log_segment_size_bytes and then double it
at the very first log segment allocation to achieve the same effect, but with cleaner
semantics for the user of the Log class.

Test Plan:
Filter the log from #5003.

  log-dump --filter_log_segment \
      ~/log-dump-test/input/wal-000000007 \
      --output_wal_dir ~/log-dump-test/output \
      --min_op_term_to_omit 8 --min_op_index_to_omit 2108 \
      --max_op_term_to_omit 8 --max_op_index_to_omit 2513

It should produce output similar to:

    I0710 18:24:57.233637 13345 log-dump.cc:249] Created directory ~/log-dump-test/output
    I0710 18:24:57.234838 13345 log-dump.cc:269] Source segment size ~/log-dump-test/input/wal-000000007: 67389360 bytes
    I0710 18:24:57.234859 13345 log-dump.cc:271] Target segment size: 74448896 bytes
    I0710 18:24:57.234988 13345 log-dump.cc:281] Will omit records between OpIds { term: 8 index: 2108 } and { term: 8 index: 2513 } (including the exact OpId matches).
    I0710 18:24:57.235208 13345 log.cc:819] T 91bf626f5794471ba61db14f07aaca8f P log-dump-tool: Setting log wal retention time to 0 seconds
    I0710 18:24:57.235327 13345 log.cc:429] interval_durable_wal_write_ms is turned on to sync every 1000 ms.
    I0710 18:24:58.092380 13345 log-dump.cc:319] Included 881 entries, omitted 406 entries
    I0710 18:24:58.330152 13345 log-dump.cc:326] Generated file ~/log-dump-test/output/index.000000000
    I0710 18:24:58.330181 13345 log-dump.cc:326] Generated file ~/log-dump-test/output/wal-000000007

Look at the size of the new file. It should be 45974166 or something close.

Then compare the output of log-dump on old and new files:

  diff <(./log-dump ~/log-dump-test/input/wal-000000007 | grep -v 'Tablet: 91bf626f5794471ba61db14f07aaca8f') \
       <(./log-dump ~/log-dump-test/output/wal-000000007 | grep -v 'Tablet: 91bf626f5794471ba61db14f07aaca8f')

The result should be similar to https://gist.githubusercontent.com/mbautin/5eccd75c5a4935ba6a760978fa76cf0b/raw

Reviewers: hector, kannan, rahuldesirazu

Reviewed By: rahuldesirazu

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8876
mbautin added a commit that referenced this issue Jul 21, 2020
… incorrect entries on tablet bootstrap

Summary:
Fixing various bugs in tablet bootstrap.

#5003: WAL tail is not being overwritten correctly by a new leader
==================================================================

We can get into a situation when part of the WAL with indexes higher
than the last index flushed into RocksDB stays in pending_replicates by
the end of tablet bootstrap. Appending a future entry will fail in this
case, as happens in #5003. This bug happens because we don't process
WAL tail overwrites caused by entries whose index is less than or equal
to the last flushed index. The solution is to always process these
overwrites, even if the newly added entry's index has no exact match in
pending_replicates, and remove entries with higher indexes.

Treating entries that were never committed as committed
=======================================================

Ppart of tablet bootstrap logic that is supposed to only run on
committed entries could sometimes run on entries that were never
committed. This was possible because we still had an old way to recover
the comitted OpId from using a Raft index-only match of flushed OpId to
the entry being processed. This is a leftover from the time when we
were setting RocksDB sequence numbers to Raft indexes. E.g. suppose
entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during
bootstrap, we would consider it committed, even though it might have
been overwritten later by entry 2.5, because its index 10 matches that
of the flushed OpId. We probably would not apply such potentially
overwritten entries to RocksDB anyway as its index is too low and we
have more checks later in the process. However, this outdated logic is
not needed, because we have the "user frontier" / "consensus frontier"
mechanism to keep track of last OpIds flushed in to RocksDBs.

Not registering flushed operations with RetryableRequests
=========================================================

RetryableRequests is a subsystem introduced in
023c20a for ignoring duplicate write
requests to a tablet. It is important for consistency, and ensures that
retries at a higher layer do not cause multiple writes. In tablet
bootstrap, we need to register committed operations that happened
during last FLAGS_retryable_request_timeout_secs seconds (120 seconds
by default) with RetryableRequests, whether or not those operations are
flushed. (Uncommitted operations are supposed to be passed to the
consensus layer as "orphaned replicates" and be registered with
RetryableRequests there.) Prior to this diff, operations with index
lower than the flushed index were not register with RetryableRequests
for transactional tables.

#3759: potentially skipping too many segments with skip_wal_rewrite
===================================================================

We were starting replay from the last segment where the committed OpId
of the first entry was less than or equal to the flushed OpId.Suppose
segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment
2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed
OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed
OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even
though it contains entries we need. Instead of using committed OpId, we
need to be selecting the first segment to replay as the last segment
where the actual OpId of the first entry is less than or equal to the
flushed OpId.

#4983: only create a new log segment after successful bootstrap
===============================================================

In the skip_wal_rewrite mode we are always creating a new log segment
before starting log replay. Then log replay picks up the new segment,
finds a "corruption" in it, and has to skip it. If a tablet server gets
into a restart loop due to tablet bootstrap failure, hundreds of new
empty log segments may get created quickly. It is much cleaner to only
create a new log segment after successful tablet bootstrap in the
skip_wal_rewrite mode (which is now the default mode). Also as part of
doing this, I simplified the ReadFirstEntryMetadata function by
delegating to the function ReadEntries with an additional parameter to
just read one entry.

Refactoring
===========

Reorganizing the tablet bootstrap code to fully hide the
TabletBootstrap class in the implementation file.

Other improvements
==================

Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ.
ASSERT_VECTORS_EQ reports differences between element sets of two
vectors in case they do not match, in addition to regular output,
making it easier to spot differences in many cases.

Documenting intended usage for YB_STRUCT_TO_STRING and
YB_CLASS_TO_STRING macros.

Test Plan:
Jenkins

A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail.

Reviewers: amitanand, sergei, rahuldesirazu

Reviewed By: sergei

Subscribers: sergei, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8917
mbautin added a commit that referenced this issue Jul 23, 2020
…leader and replay of incorrect entries on tablet bootstrap

Summary:
This is a backport of e4e2d39 (D8917).

Fixing various bugs in tablet bootstrap.

#5003: WAL tail is not being overwritten correctly by a new leader
==================================================================

We can get into a situation when part of the WAL with indexes higher
than the last index flushed into RocksDB stays in pending_replicates by
the end of tablet bootstrap. Appending a future entry will fail in this
case, as happens in #5003. This bug happens because we don't process
WAL tail overwrites caused by entries whose index is less than or equal
to the last flushed index. The solution is to always process these
overwrites, even if the newly added entry's index has no exact match in
pending_replicates, and remove entries with higher indexes.

Treating entries that were never committed as committed
=======================================================

Part of tablet bootstrap logic that is supposed to only run on
committed entries could sometimes run on entries that were never
committed. This was possible because we still had an old way to recover
the comitted OpId from using a Raft index-only match of flushed OpId to
the entry being processed. This is a leftover from the time when we
were setting RocksDB sequence numbers to Raft indexes. E.g. suppose
entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during
bootstrap, we would consider it committed, even though it might have
been overwritten later by entry 2.5, because its index 10 matches that
of the flushed OpId. We probably would not apply such potentially
overwritten entries to RocksDB anyway as its index is too low and we
have more checks later in the process. However, this outdated logic is
not needed, because we have the "user frontier" / "consensus frontier"
mechanism to keep track of last OpIds flushed in to RocksDBs.

Not registering flushed operations with RetryableRequests
=========================================================

RetryableRequests is a subsystem introduced in
023c20a for ignoring duplicate write
requests to a tablet. It is important for consistency, and ensures that
retries at a higher layer do not cause multiple writes. In tablet
bootstrap, we need to register committed operations that happened
during last FLAGS_retryable_request_timeout_secs seconds (120 seconds
by default) with RetryableRequests, whether or not those operations are
flushed. (Uncommitted operations are supposed to be passed to the
consensus layer as "orphaned replicates" and be registered with
RetryableRequests there.) Prior to this diff, operations with index
lower than the flushed index were not register with RetryableRequests
for transactional tables.

[#3759] Potentially skipping too many segments with skip_wal_rewrite
===================================================================

We were starting replay from the last segment where the committed OpId
of the first entry was less than or equal to the flushed OpId.Suppose
segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment
2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed
OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed
OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even
though it contains entries we need. Instead of using committed OpId, we
need to be selecting the first segment to replay as the last segment
where the actual OpId of the first entry is less than or equal to the
flushed OpId.

[#4983] Only create a new log segment after successful bootstrap
===============================================================

In the skip_wal_rewrite mode we are always creating a new log segment
before starting log replay. Then log replay picks up the new segment,
finds a "corruption" in it, and has to skip it. If a tablet server gets
into a restart loop due to tablet bootstrap failure, hundreds of new
empty log segments may get created quickly. It is much cleaner to only
create a new log segment after successful tablet bootstrap in the
skip_wal_rewrite mode (which is now the default mode). Also as part of
doing this, I simplified the ReadFirstEntryMetadata function by
delegating to the function ReadEntries with an additional parameter to
just read one entry.

Refactoring
===========

Reorganizing the tablet bootstrap code to fully hide the
TabletBootstrap class in the implementation file.

Other improvements
==================

Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ.
ASSERT_VECTORS_EQ reports differences between element sets of two
vectors in case they do not match, in addition to regular output,
making it easier to spot differences in many cases.

Documenting intended usage for YB_STRUCT_TO_STRING and
YB_CLASS_TO_STRING macros.

Test Plan:
Jenkins: urgent, patch: 2.1.8, auto rebase: no

A new randomized tablet bootstrap test creating a log with multiple
terms and overwriting of the WAL's tail.

Reviewers: amitanand, rahuldesirazu, sergei

Reviewed By: sergei

Subscribers: bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8990
mbautin added a commit that referenced this issue Jul 23, 2020
…leader and replay of incorrect entries on tablet bootstrap

Summary:
This is a backport of e4e2d39 (D8917).

Fixing various bugs in tablet bootstrap.

#5003: WAL tail is not being overwritten correctly by a new leader
==================================================================

We can get into a situation when part of the WAL with indexes higher
than the last index flushed into RocksDB stays in pending_replicates by
the end of tablet bootstrap. Appending a future entry will fail in this
case, as happens in #5003. This bug happens because we don't process
WAL tail overwrites caused by entries whose index is less than or equal
to the last flushed index. The solution is to always process these
overwrites, even if the newly added entry's index has no exact match in
pending_replicates, and remove entries with higher indexes.

Treating entries that were never committed as committed
=======================================================

Part of tablet bootstrap logic that is supposed to only run on
committed entries could sometimes run on entries that were never
committed. This was possible because we still had an old way to recover
the comitted OpId from using a Raft index-only match of flushed OpId to
the entry being processed. This is a leftover from the time when we
were setting RocksDB sequence numbers to Raft indexes. E.g. suppose
entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during
bootstrap, we would consider it committed, even though it might have
been overwritten later by entry 2.5, because its index 10 matches that
of the flushed OpId. We probably would not apply such potentially
overwritten entries to RocksDB anyway as its index is too low and we
have more checks later in the process. However, this outdated logic is
not needed, because we have the "user frontier" / "consensus frontier"
mechanism to keep track of last OpIds flushed in to RocksDBs.

Not registering flushed operations with RetryableRequests
=========================================================

RetryableRequests is a subsystem introduced in
023c20a for ignoring duplicate write
requests to a tablet. It is important for consistency, and ensures that
retries at a higher layer do not cause multiple writes. In tablet
bootstrap, we need to register committed operations that happened
during last FLAGS_retryable_request_timeout_secs seconds (120 seconds
by default) with RetryableRequests, whether or not those operations are
flushed. (Uncommitted operations are supposed to be passed to the
consensus layer as "orphaned replicates" and be registered with
RetryableRequests there.) Prior to this diff, operations with index
lower than the flushed index were not register with RetryableRequests
for transactional tables.

[#3759] Potentially skipping too many segments with skip_wal_rewrite
===================================================================

We were starting replay from the last segment where the committed OpId
of the first entry was less than or equal to the flushed OpId.Suppose
segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment
2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed
OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed
OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even
though it contains entries we need. Instead of using committed OpId, we
need to be selecting the first segment to replay as the last segment
where the actual OpId of the first entry is less than or equal to the
flushed OpId.

[#4983] Only create a new log segment after successful bootstrap
===============================================================

In the skip_wal_rewrite mode we are always creating a new log segment
before starting log replay. Then log replay picks up the new segment,
finds a "corruption" in it, and has to skip it. If a tablet server gets
into a restart loop due to tablet bootstrap failure, hundreds of new
empty log segments may get created quickly. It is much cleaner to only
create a new log segment after successful tablet bootstrap in the
skip_wal_rewrite mode (which is now the default mode). Also as part of
doing this, I simplified the ReadFirstEntryMetadata function by
delegating to the function ReadEntries with an additional parameter to
just read one entry.

Refactoring
===========

Reorganizing the tablet bootstrap code to fully hide the
TabletBootstrap class in the implementation file.

Other improvements
==================

Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ.
ASSERT_VECTORS_EQ reports differences between element sets of two
vectors in case they do not match, in addition to regular output,
making it easier to spot differences in many cases.

Documenting intended usage for YB_STRUCT_TO_STRING and
YB_CLASS_TO_STRING macros.

Test Plan:
Jenkins: patch: 2.2.0

A new randomized tablet bootstrap test creating a log with multiple
terms and overwriting of the WAL's tail.

Reviewers: amitanand, rahuldesirazu, sergei

Reviewed By: sergei

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8988
@mbautin mbautin closed this as completed Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features
Projects
None yet
Development

No branches or pull requests

1 participant