Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docdb] During bootstrap skip flushed entries could skip log segment that contain committed operation #3759

Closed
spolitov opened this issue Feb 27, 2020 · 1 comment
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority

Comments

@spolitov
Copy link
Contributor

spolitov commented Feb 27, 2020

The could result in the following crash loop:

[yugabyte@ip-192-168-238-53 logs]$ cat yb-tserver.FATAL.details.2020-02-27T08_38_02.pid100248.txt
F20200227 08:38:02 ../../src/yb/consensus/replica_state.cc:791] Check failed: pending_operations_.front()->id().index() == last_committed_op_id_.index + 1 (108780008 vs. 108776042) 
    @     0x7f1024ae1fd5  yb::LogFatalHandlerSink::send()
    @     0x7f10242d9466  google::LogMessage::SendToLog()
    @     0x7f10242d683a  google::LogMessage::Flush()
    @     0x7f10242d9a89  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f1026650a98  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0x7f102665107a  yb::consensus::ReplicaState::UpdateMajorityReplicatedUnlocked()
    @     0x7f102662aea6  yb::consensus::RaftConsensus::UpdateMajorityReplicated()
    @     0x7f102660ce4e  yb::consensus::PeerMessageQueue::NotifyObserversOfMajorityReplOpChangeTask()
    @     0x7f1024b70ea4  yb::ThreadPool::DispatchThread()
    @     0x7f1024b6d6b3  yb::Thread::SuperviseThread()
    @     0x7f1022c3afc4  start_thread
    @     0x7f1022b399bd  __clone
    @              (nil)  (unknown)
@spolitov spolitov added the area/docdb YugabyteDB core features label Feb 27, 2020
@rao-vasireddy rao-vasireddy added kind/bug This issue is a bug priority/high High Priority labels Feb 27, 2020
@bmatican
Copy link
Contributor

This looks like the same thing we hit AFTER the initial crash in #3413 (comment) cc @rahuldesirazu

@bmatican bmatican assigned rahuldesirazu and unassigned bmatican Feb 28, 2020
mbautin added a commit that referenced this issue Jul 21, 2020
… incorrect entries on tablet bootstrap

Summary:
Fixing various bugs in tablet bootstrap.

#5003: WAL tail is not being overwritten correctly by a new leader
==================================================================

We can get into a situation when part of the WAL with indexes higher
than the last index flushed into RocksDB stays in pending_replicates by
the end of tablet bootstrap. Appending a future entry will fail in this
case, as happens in #5003. This bug happens because we don't process
WAL tail overwrites caused by entries whose index is less than or equal
to the last flushed index. The solution is to always process these
overwrites, even if the newly added entry's index has no exact match in
pending_replicates, and remove entries with higher indexes.

Treating entries that were never committed as committed
=======================================================

Ppart of tablet bootstrap logic that is supposed to only run on
committed entries could sometimes run on entries that were never
committed. This was possible because we still had an old way to recover
the comitted OpId from using a Raft index-only match of flushed OpId to
the entry being processed. This is a leftover from the time when we
were setting RocksDB sequence numbers to Raft indexes. E.g. suppose
entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during
bootstrap, we would consider it committed, even though it might have
been overwritten later by entry 2.5, because its index 10 matches that
of the flushed OpId. We probably would not apply such potentially
overwritten entries to RocksDB anyway as its index is too low and we
have more checks later in the process. However, this outdated logic is
not needed, because we have the "user frontier" / "consensus frontier"
mechanism to keep track of last OpIds flushed in to RocksDBs.

Not registering flushed operations with RetryableRequests
=========================================================

RetryableRequests is a subsystem introduced in
023c20a for ignoring duplicate write
requests to a tablet. It is important for consistency, and ensures that
retries at a higher layer do not cause multiple writes. In tablet
bootstrap, we need to register committed operations that happened
during last FLAGS_retryable_request_timeout_secs seconds (120 seconds
by default) with RetryableRequests, whether or not those operations are
flushed. (Uncommitted operations are supposed to be passed to the
consensus layer as "orphaned replicates" and be registered with
RetryableRequests there.) Prior to this diff, operations with index
lower than the flushed index were not register with RetryableRequests
for transactional tables.

#3759: potentially skipping too many segments with skip_wal_rewrite
===================================================================

We were starting replay from the last segment where the committed OpId
of the first entry was less than or equal to the flushed OpId.Suppose
segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment
2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed
OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed
OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even
though it contains entries we need. Instead of using committed OpId, we
need to be selecting the first segment to replay as the last segment
where the actual OpId of the first entry is less than or equal to the
flushed OpId.

#4983: only create a new log segment after successful bootstrap
===============================================================

In the skip_wal_rewrite mode we are always creating a new log segment
before starting log replay. Then log replay picks up the new segment,
finds a "corruption" in it, and has to skip it. If a tablet server gets
into a restart loop due to tablet bootstrap failure, hundreds of new
empty log segments may get created quickly. It is much cleaner to only
create a new log segment after successful tablet bootstrap in the
skip_wal_rewrite mode (which is now the default mode). Also as part of
doing this, I simplified the ReadFirstEntryMetadata function by
delegating to the function ReadEntries with an additional parameter to
just read one entry.

Refactoring
===========

Reorganizing the tablet bootstrap code to fully hide the
TabletBootstrap class in the implementation file.

Other improvements
==================

Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ.
ASSERT_VECTORS_EQ reports differences between element sets of two
vectors in case they do not match, in addition to regular output,
making it easier to spot differences in many cases.

Documenting intended usage for YB_STRUCT_TO_STRING and
YB_CLASS_TO_STRING macros.

Test Plan:
Jenkins

A new randomized tablet bootstrap test creating a log with multiple terms and overwriting of the WAL's tail.

Reviewers: amitanand, sergei, rahuldesirazu

Reviewed By: sergei

Subscribers: sergei, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8917
mbautin added a commit that referenced this issue Jul 23, 2020
…leader and replay of incorrect entries on tablet bootstrap

Summary:
This is a backport of e4e2d39 (D8917).

Fixing various bugs in tablet bootstrap.

#5003: WAL tail is not being overwritten correctly by a new leader
==================================================================

We can get into a situation when part of the WAL with indexes higher
than the last index flushed into RocksDB stays in pending_replicates by
the end of tablet bootstrap. Appending a future entry will fail in this
case, as happens in #5003. This bug happens because we don't process
WAL tail overwrites caused by entries whose index is less than or equal
to the last flushed index. The solution is to always process these
overwrites, even if the newly added entry's index has no exact match in
pending_replicates, and remove entries with higher indexes.

Treating entries that were never committed as committed
=======================================================

Part of tablet bootstrap logic that is supposed to only run on
committed entries could sometimes run on entries that were never
committed. This was possible because we still had an old way to recover
the comitted OpId from using a Raft index-only match of flushed OpId to
the entry being processed. This is a leftover from the time when we
were setting RocksDB sequence numbers to Raft indexes. E.g. suppose
entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during
bootstrap, we would consider it committed, even though it might have
been overwritten later by entry 2.5, because its index 10 matches that
of the flushed OpId. We probably would not apply such potentially
overwritten entries to RocksDB anyway as its index is too low and we
have more checks later in the process. However, this outdated logic is
not needed, because we have the "user frontier" / "consensus frontier"
mechanism to keep track of last OpIds flushed in to RocksDBs.

Not registering flushed operations with RetryableRequests
=========================================================

RetryableRequests is a subsystem introduced in
023c20a for ignoring duplicate write
requests to a tablet. It is important for consistency, and ensures that
retries at a higher layer do not cause multiple writes. In tablet
bootstrap, we need to register committed operations that happened
during last FLAGS_retryable_request_timeout_secs seconds (120 seconds
by default) with RetryableRequests, whether or not those operations are
flushed. (Uncommitted operations are supposed to be passed to the
consensus layer as "orphaned replicates" and be registered with
RetryableRequests there.) Prior to this diff, operations with index
lower than the flushed index were not register with RetryableRequests
for transactional tables.

[#3759] Potentially skipping too many segments with skip_wal_rewrite
===================================================================

We were starting replay from the last segment where the committed OpId
of the first entry was less than or equal to the flushed OpId.Suppose
segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment
2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed
OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed
OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even
though it contains entries we need. Instead of using committed OpId, we
need to be selecting the first segment to replay as the last segment
where the actual OpId of the first entry is less than or equal to the
flushed OpId.

[#4983] Only create a new log segment after successful bootstrap
===============================================================

In the skip_wal_rewrite mode we are always creating a new log segment
before starting log replay. Then log replay picks up the new segment,
finds a "corruption" in it, and has to skip it. If a tablet server gets
into a restart loop due to tablet bootstrap failure, hundreds of new
empty log segments may get created quickly. It is much cleaner to only
create a new log segment after successful tablet bootstrap in the
skip_wal_rewrite mode (which is now the default mode). Also as part of
doing this, I simplified the ReadFirstEntryMetadata function by
delegating to the function ReadEntries with an additional parameter to
just read one entry.

Refactoring
===========

Reorganizing the tablet bootstrap code to fully hide the
TabletBootstrap class in the implementation file.

Other improvements
==================

Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ.
ASSERT_VECTORS_EQ reports differences between element sets of two
vectors in case they do not match, in addition to regular output,
making it easier to spot differences in many cases.

Documenting intended usage for YB_STRUCT_TO_STRING and
YB_CLASS_TO_STRING macros.

Test Plan:
Jenkins: urgent, patch: 2.1.8, auto rebase: no

A new randomized tablet bootstrap test creating a log with multiple
terms and overwriting of the WAL's tail.

Reviewers: amitanand, rahuldesirazu, sergei

Reviewed By: sergei

Subscribers: bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8990
mbautin added a commit that referenced this issue Jul 23, 2020
…leader and replay of incorrect entries on tablet bootstrap

Summary:
This is a backport of e4e2d39 (D8917).

Fixing various bugs in tablet bootstrap.

#5003: WAL tail is not being overwritten correctly by a new leader
==================================================================

We can get into a situation when part of the WAL with indexes higher
than the last index flushed into RocksDB stays in pending_replicates by
the end of tablet bootstrap. Appending a future entry will fail in this
case, as happens in #5003. This bug happens because we don't process
WAL tail overwrites caused by entries whose index is less than or equal
to the last flushed index. The solution is to always process these
overwrites, even if the newly added entry's index has no exact match in
pending_replicates, and remove entries with higher indexes.

Treating entries that were never committed as committed
=======================================================

Part of tablet bootstrap logic that is supposed to only run on
committed entries could sometimes run on entries that were never
committed. This was possible because we still had an old way to recover
the comitted OpId from using a Raft index-only match of flushed OpId to
the entry being processed. This is a leftover from the time when we
were setting RocksDB sequence numbers to Raft indexes. E.g. suppose
entry 2.10 is flushed into RocksDB. When we encounter entry 1.10 during
bootstrap, we would consider it committed, even though it might have
been overwritten later by entry 2.5, because its index 10 matches that
of the flushed OpId. We probably would not apply such potentially
overwritten entries to RocksDB anyway as its index is too low and we
have more checks later in the process. However, this outdated logic is
not needed, because we have the "user frontier" / "consensus frontier"
mechanism to keep track of last OpIds flushed in to RocksDBs.

Not registering flushed operations with RetryableRequests
=========================================================

RetryableRequests is a subsystem introduced in
023c20a for ignoring duplicate write
requests to a tablet. It is important for consistency, and ensures that
retries at a higher layer do not cause multiple writes. In tablet
bootstrap, we need to register committed operations that happened
during last FLAGS_retryable_request_timeout_secs seconds (120 seconds
by default) with RetryableRequests, whether or not those operations are
flushed. (Uncommitted operations are supposed to be passed to the
consensus layer as "orphaned replicates" and be registered with
RetryableRequests there.) Prior to this diff, operations with index
lower than the flushed index were not register with RetryableRequests
for transactional tables.

[#3759] Potentially skipping too many segments with skip_wal_rewrite
===================================================================

We were starting replay from the last segment where the committed OpId
of the first entry was less than or equal to the flushed OpId.Suppose
segment 1 has entries 1.1 to 1.10 and nothing is committed, and segment
2 has entries 1.11 and 1.20, and only entry 1.2 is committed. Flushed
OpId is 1.5. Then we might look at segment 2, find that 1.2 (committed
OpId) <= 1.5 (flushed OpId), and incorrectly skip segment 1, even
though it contains entries we need. Instead of using committed OpId, we
need to be selecting the first segment to replay as the last segment
where the actual OpId of the first entry is less than or equal to the
flushed OpId.

[#4983] Only create a new log segment after successful bootstrap
===============================================================

In the skip_wal_rewrite mode we are always creating a new log segment
before starting log replay. Then log replay picks up the new segment,
finds a "corruption" in it, and has to skip it. If a tablet server gets
into a restart loop due to tablet bootstrap failure, hundreds of new
empty log segments may get created quickly. It is much cleaner to only
create a new log segment after successful tablet bootstrap in the
skip_wal_rewrite mode (which is now the default mode). Also as part of
doing this, I simplified the ReadFirstEntryMetadata function by
delegating to the function ReadEntries with an additional parameter to
just read one entry.

Refactoring
===========

Reorganizing the tablet bootstrap code to fully hide the
TabletBootstrap class in the implementation file.

Other improvements
==================

Adding new test macros ASSERT_SETS_EQ and ASSERT_VECTORS_EQ.
ASSERT_VECTORS_EQ reports differences between element sets of two
vectors in case they do not match, in addition to regular output,
making it easier to spot differences in many cases.

Documenting intended usage for YB_STRUCT_TO_STRING and
YB_CLASS_TO_STRING macros.

Test Plan:
Jenkins: patch: 2.2.0

A new randomized tablet bootstrap test creating a log with multiple
terms and overwriting of the WAL's tail.

Reviewers: amitanand, rahuldesirazu, sergei

Reviewed By: sergei

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8988
@bmatican bmatican changed the title During bootstrap skip flushed entries could skip log segment that contain committed operation [docdb] During bootstrap skip flushed entries could skip log segment that contain committed operation Oct 10, 2020
@bmatican bmatican assigned mbautin and unassigned rahuldesirazu Oct 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority
Projects
None yet
Development

No branches or pull requests

5 participants