-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto recovery from out of space errors #4164
Conversation
Setting the WIP tag to account for the remaining testing. The implementation is largely complete, I think, pending feedback/comments. |
32e8acf
to
32e8767
Compare
@@ -27,6 +27,7 @@ enum class TableFileCreationReason { | |||
kFlush, | |||
kCompaction, | |||
kRecovery, | |||
kMisc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to be more specific with the reason here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rethink this. I was using TableFileCreationReason
to signal SstFileManagerImpl::OnAddFile
that the file was added due to compaction (with kMisc as sort of the default), but it doesn't necessarily mean the file was just created. It could be called during DB::Open
, when its just listing all the files. So maybe TableFileCreationReason
is not appropriate. Instead, I can have an explicit parameter in OnAddFile
to specify whether the file being added was created by compaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still reviewing. This is more complicated than I though.
.travis.yml
Outdated
@@ -116,6 +121,7 @@ script: | |||
mkdir build && cd build && cmake -DJNI=1 .. -DCMAKE_BUILD_TYPE=Release && make -j4 rocksdb rocksdbjni | |||
;; | |||
esac | |||
- for i in $(find ./ -maxdepth 1 -name 'core*' -print); do gdb ./error_handler_test core* -ex "thread apply all bt" -ex "set pagination 0" -batch; done; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, this is complicated unit test technique...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just temporary. I'm not going to merge this file. One of the tests is crashing only in Travis, so this is trying to print the backtrace.
db/error_handler.h
Outdated
private: | ||
bool IsRecoveryInProgress() { return recovery_in_prog_; } | ||
|
||
Status RecoverFromBGError(bool exclusive = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we need exclusive and non-exclusive mode, and can't make everything exclusive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's in fact the idea - to have only one kind of recovery going on (either auto or manual). I think the implementation is probably confusing. The idea is that a user that doesn't want auto recovery would turn it off in the OnErrorRecoveryBegin
event callback, and then call DB::Resume()
. The exclusive
parameter here is to indicate whether the call is from manual recovery or not. I should change the name to be more explicit.
@@ -844,6 +845,8 @@ class DBImpl : public DB { | |||
bool read_only = false, bool error_if_log_file_exist = false, | |||
bool error_if_data_exists_in_logs = false); | |||
|
|||
Status ResumeImpl(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write a comment describing what the function is doing?
db/db_impl.cc
Outdated
"Shutdown: canceling all background work"); | ||
if (shutdown) { | ||
ROCKS_LOG_INFO(immutable_db_options_.info_log, | ||
"Shutdown: canceling all background work"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we set a flag here, will the race condition between recovery and shutdown be easier to resolve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a interesting suggestion. We could set a new flag here, and in DBImpl::CloseHelper
, call error_handler_.CancelErrorRecovery
before CancelAllBackgroundWork
. That way, we would not have both trying to schedule memtable flushes. Let me look into this a bit more.
db/db_impl.cc
Outdated
} | ||
|
||
// Will lock the mutex_, will wait for completion if wait is true | ||
void DBImpl::CancelAllBackgroundWork(bool wait) { | ||
void DBImpl::CancelAllBackgroundWork(bool wait, bool shutdown) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is all what it does in shutdown=false
case these lines?
while (bg_bottom_compaction_scheduled_ || bg_compaction_scheduled_ ||
bg_flush_scheduled_) {
bg_cv_.Wait();
}
If that is the case, is it easier to make these lines a helper function, rather than fitting into CancelAllBackgroundWork()? Otherwise, it doesn't feel cancelling any background work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. It is effectively just waiting for background work to be completed. This can be moved to a helper and reused.
util/sst_file_manager_impl.cc
Outdated
closing_(false), | ||
reserved_disk_buffer_(0), | ||
free_space_trigger_(0) { | ||
bg_thread_.reset(new port::Thread(&SstFileManagerImpl::ClearError, this)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to start it only when out-of-space happened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think that's a good idea. That way, if an out of space error never happens, the behavior is exactly the same as it is now.
util/sst_file_manager_impl.cc
Outdated
Status s = env_->GetFreeSpace(path_, &free_space); | ||
if (s.ok()) { | ||
if (bg_err_.severity() == Status::Severity::kHardError) { | ||
// Give priority to hard errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment more about what it means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its referring to the case where some DB instances experienced soft errors (i.e compaction failure) and some experienced hard errors (i.e flush/WAL write failures), we would estimate required free space as if all instances experienced hard errors. For hard errors, the free space threshold is reserved_disk_buffer_
, which is basically the sum of write_buffer_size of all instances. For soft errors, the threshold is free_space_trigger_
, which is the value of cur_compactions_reserved_size_
at the time of error.
util/sst_file_manager_impl.cc
Outdated
(needed_headroom + total_files_size_ > max_allowed_space_)) { | ||
cur_compactions_reserved_size_ -= size_added_by_compaction; | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly complicated logic. I wonder whether we can simplify it, trading off some feature supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of things I added here that increased the complexity - 1. The in_progress_files_size_ was added in order to account for a compaction that has generated some outputs that are already counted in cur_compactions_reserved_size_, and 2. Since compaction_buffer_size_ is a user provided parameter during SstFileManager
construction, choose a default if they didn't set it.
I can move these down below to the NoSpace
case in order to avoid impacting the normal case. Maybe #2 can also be removed, depending on how conservative we want to be when estimating headroom after getting out of space error.
util/sst_file_manager_impl.cc
Outdated
} else { | ||
// Take a snapshot of cur_compactions_reserved_size_ for when we encounter | ||
// a NoSpace error. | ||
free_space_trigger_ = cur_compactions_reserved_size_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a worry using this one.
For example, if a user runs universal compaction, and has 200 shards, and it is out-of-memory. Eventually, full compaction of all the shards will reserve the size equivalent to the whole shard size, and fail. The compaction trigger would be equal to the total size of the 200 shards. Then users need to drop the disk space to lower than 50% to make it recover. This is going to be much more than what is needed. Is it a valid worry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cur_compactions_reserved_size_
is the space reserved by currently running compactions. In your example, would the user tune max_background_jobs
to limit the number of concurrent compactions? If they set it to a high number, they will always run the risk of doubling the space used. This check may, in fact benefit that case, as it does not stop background work but rather just slows it down. If EnoughRoomForCompaction
returns false, the compaction is retried after some time. So it would end up throttling concurrent compactions to some lower number that can actually be accommodated by the available disk space.
When an SST file manager manages multiple DBs, can we recover them slowly, rather than enable them all at one shot? I'm worry about the sudden space increase may complicate things. |
Summary: Search or jump to… Pull requests Issues Marketplace Explore @anand1976 Sign out 882 11,071 2,425 facebook/rocksdb Code Issues 124 Pull requests 88 Projects 1 Wiki Insights Auto recovery from out of space errors Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: 1. Fix corner cases where compaction state is left inconsistent a) Compaction triggered by recovery b) Compaction failure after writes are stopped (level0_stop_writes_trigger) - add it back to compaction queue 2. Prevent indefinite wait in DBImpl::DelayWrite() 3. Fix required free space calculation in SstFileManagerImpl Test Plan: Tested using db_bench on a loopback filesystem and periodically filling up the fs Reviewers: Subscribers: Tasks: Tags:
Summary: 1. Rebase and resolve conflicts 2. Work around GetFreeSpace conflict in Windows build 3. Misc. Travis compilation errors Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: .travis.yml with a specific config and coredump/gdb enabled Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: 1. Add tests for multiple column family and multiple DB cases 2. Change SstFileManager::OnAddFile signature 3. Better isolate DB instances in the multiple instance case by not impacting one instance if another has run into error, in order to preserve existing behavior Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: The main conflict is with the commit 7daae51, which refactors flush request processing. Test Plan: Reviewers: Subscribers: Tasks: Tags:
0c05ac8
to
a59b8b2
Compare
db/db_impl.h
Outdated
// error recovery from going on in parallel. The latter, shutting_down_, | ||
// is set a little later during the shutdown after scheduling memtable | ||
// flushes | ||
bool shutdown_flag_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call it shutdown_initiated_
directly?
db/db_impl.h
Outdated
// error recovery from going on in parallel. The latter, shutting_down_, | ||
// is set a little later during the shutdown after scheduling memtable | ||
// flushes | ||
bool shutdown_flag_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm picky but if possible can we place it together with those bool variables, for better padding?
util/sst_file_manager_impl.cc
Outdated
// NoSpace. If that happens, we abort the recovery here and start | ||
// over. | ||
mu_.Unlock(); | ||
s = cur_instance_->RecoverFromBGError(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that, before the mutex is acquired again, the same DB recovered, but quickly failed again. Since this DB is still en error_handler_list_
, it is not inserted again, but we poped it up in line 294, so the error handle of this failure DB is out of track?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch. I think the safest way for now is to check the DB's error status again after acquiring the mutex, and leave it in the list if there is an error.
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Temporarily disbling these tests in Travis as they are only failing in Travis and not reproducible elsewhere. Test Plan: Reviewers: Subscribers: Tasks: Tags:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: 1. Add override keyword to overridden virtual functions in EventListener 2. Fix a memory corruption that can happen during DB shutdown when in read-only mode due to a background write error 3. Fix uninitialized buffers in error_handler_test.cc that cause valgrind to complain Test Plan: make check make valgrind_test Reviewers: Subscribers: Tasks: Tags:
Summary: 1. Add override keyword to overridden virtual functions in EventListener 2. Fix a memory corruption that can happen during DB shutdown when in read-only mode due to a background write error 3. Fix uninitialized buffers in error_handler_test.cc that cause valgrind to complain Test Plan: make check make check (clang) make valgrind_test make ubsan_check Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: 1. Add override keyword to overridden virtual functions in EventListener 2. Fix a memory corruption that can happen during DB shutdown when in read-only mode due to a background write error 3. Fix uninitialized buffers in error_handler_test.cc that cause valgrind to complain Pull Request resolved: #4375 Differential Revision: D9875779 Pulled By: anand1976 fbshipit-source-id: 022ede1edc01a9f7e21ecf4c61ef7d46545d0640
cv_.SignalAll(); | ||
} | ||
|
||
bool WaitForRecovery(uint64_t /*abs_time_us*/) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anand1976 Hi! Seems the abs_time_us
here should be uncomment? Otherwise db_bench will wait forever on error.
Summary: A new interface method Env::GetFreeSpace was added in #4164. It needs to be implemented for Windows port. Some error_handler_test cases fail on Windows because recovery cannot succeed without free space being reported. Pull Request resolved: #6265 Differential Revision: D19303065 fbshipit-source-id: 1f1a83e53f334284781cf61feabc996e87b945ca
Hi @anand1976 / @siying, I had some questions around this change. Thanks in advance!
|
Summary: The OnAddFile cur_compactions_reserved_size_ accounting causes wraparound when re-opening a database with an unowned SstFileManager and during recovery. It was introduced in facebook#4164 which addresses out of space recovery with an unclear purpose. Compaction jobs do this accounting via EnoughRoomForCompaction/OnCompactionCompletion and to my understanding would never reuse a sst file name. Differential Revision: D62535775
Summary: Pull Request resolved: facebook#13010 The OnAddFile cur_compactions_reserved_size_ accounting causes wraparound when re-opening a database with an unowned SstFileManager and during recovery. It was introduced in facebook#4164 which addresses out of space recovery with an unclear purpose. Compaction jobs do this accounting via EnoughRoomForCompaction/OnCompactionCompletion and to my understanding would never reuse a sst file name. Reviewed By: anand1976 Differential Revision: D62535775
Summary: Pull Request resolved: #13010 The OnAddFile cur_compactions_reserved_size_ accounting causes wraparound when re-opening a database with an unowned SstFileManager and during recovery. It was introduced in #4164 which addresses out of space recovery with an unclear purpose. Compaction jobs do this accounting via EnoughRoomForCompaction/OnCompactionCompletion and to my understanding would never reuse a sst file name. Reviewed By: anand1976 Differential Revision: D62535775 fbshipit-source-id: a7c44d6e0a4b5ff74bc47abfe57c32ca6770243d
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags: