Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Cleanup unrefrenced files incase segment merge fails #8024

Closed
RS146BIJAY opened this issue Jun 12, 2023 · 10 comments
Closed

[RFC] Cleanup unrefrenced files incase segment merge fails #8024

RS146BIJAY opened this issue Jun 12, 2023 · 10 comments
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage Issues and PRs relating to data and metadata storage

Comments

@RS146BIJAY
Copy link
Contributor

RS146BIJAY commented Jun 12, 2023

Overview

OpenSearch periodically merges multiple smaller segment into larger segments to keep the index size at bay and to expunge deletes. An entire merge operation requires at least 3x times space of the segments that are merged (with maximum of 3x times shard size in case of force merge with max segment = 1). In case enough space is not available for the segment merge to go through, the merge will fail at any of the intermediate step. Unreferenced files created during segment merge failure can take up a lot of space, leading to a full data volume and potential node drops.

Unreferenced Files

In Lucene, unreferenced files are shard files which are no longer live and used by the shard. Any Lucene commits (segments_N files) does not reference these files. Neither they are actively updated by IndexWriter. When segment merges fail because of disk getting full, multiple unreferenced files can get generated. This continues to occupy a lot of space in data volume. Lucene intentionally does not delete these unreferenced files by marking disk full issue as a tragedy. Lucene avoid wasting unnecessary IO/CPU cycles by doing this (in case Lucene cleans up these files, the same segment merge will be retired, filling up the space and again clearing unreferenced files in a loop wasting CPU/IO cycles).

Proposed Solutions

This document analyses multiple proposals to cleanup these unreferenced files in case segment merge fails.

Approach 1 (Inside Lucene)

One of the approach can be Lucene handling cleanup of unreferenced files in case segment merge fails because of disk full. As discussed above, in case segment merge fails because of disk full, Lucene considers it as a Tragedy. In such scenarios, Lucene skips deleting the unreferenced files and closes the IndexWriter and later, OpenSearch closes the corresponding shard.

Since disk getting 100% full does not necessarily mean the system cannot recover (like in case of VirtualMachineError), IOException because of disk full should not be tragic. Instead, in such a scenario, Lucene should delete the unreferenced files generated during the failed segment merge.

Once cleanup completes, Lucene can close the IndexWriter instance as usual to avoid any parallel operation (like parallel writes, relocation, etc) which can change the shard state during unreferenced files cleanup. After this, OpenSearch will close InternalEngine and mark the shard as unassigned. Prior to which it will trigger a local shard recovery, which will reopen the InternalEngine and IndexWriter bringing the shard state to STARTED.

In order to avoid above operations to get in a loop (segment merge → clean up → segment merge), we can further change Lucene merge policy to not allow segment merges in case enough amount of space is not available. This will prevent IO/CPU cycles wastage.

Discussion Thread

apache/lucene#12228

Pros

  1. The biggest benefit of using this approach is Lucene will continue to manage Segment files. OpenSearch will not care about handling the unreferenced files generated from Lucene.
  2. This seems a cleaner approach as OpenSearch does not need to wait on Lucene operations to complete, while performing unreferenced files clean up operation.

Cons

  1. Lucene does not have the concepts of blocks. In case we make this change inside Lucene, there can be a scenario where we will allow writes even though segment merges will not happen. This will cause segment counts to grow and we can start reaching open file descriptors limits causing performance issues.

Approach 2 (Inside OpenSearch)

Another approach to handle unreferenced file cleanup is to handle it within OpenSearch itself. In this approach, OpenSearch will perform the cleanup once the shard is closed and marked as unassigned at the end of the failEngine function inside InternalEngine. Before we clean the unreferenced files, we will validate whether the engine failed because of data volume getting full during segment merge. If yes, we will perform the cleanup.

Since only Lucene have information about which files are unreferenced (Lucene maintains this info inside IndexFileDeleter), we will remove unreferenced files inside OpenSearch by creating an instance of IndexWriter inside a try with resource block (since this approach is already being used in OpenSearch). Internally, it creates a new instance of IndexFileDeleter which will remove the files not referenced by any of the commits.

Once the shard gets closed and cleanup is completed, OpenSearch marks this shard as unassigned. Since this is a cluster state change (shard state changed to unassigned) and there is already a valid copy of the shard on the node, OpenSearch will trigger a local recovery for this shard. Once Recovery is completed and the translog is played on this shard, shard state is marked as STARTED.

In order to prevent cleanup operation and segment merge to not to get in a loop (segment merge → clean up → segment merge), we can further change OpenSearch merge policy to not allow segment merges in case enough amount of space is not available. This will prevent IO/CPU cycles wastage.

Pros

  1. It seems a cleaner approach as we do not need explicitly handle multithreading scenarios while cleaning up unreferenced files.
  2. Since clean up is performed after shard is failed, no parallel operation on that shard will be ongoing while unreferenced files are cleaned up.
  3. Also we do not handle operations like reopening IndexWriter and IndexReader. Local shard recovery handles all these after cleanup completes.

Cons

  1. We need to fail the shard in this approach.

How Can You Help?

  1. Provide early feedback on any issue which we see with above approaches.

Next Steps

We will incorporate feedback and continue with more concrete prototypes.

@gbbafna
Copy link
Collaborator

gbbafna commented Jun 13, 2023

Thanks @RS146BIJAY for this. Option 2 makes sense as Option 1 doesn't look like Lucene's problem to solve from the issue you opened.

Since this is a cluster state change (shard state changed to unassigned) and there is already a valid copy of the shard on the node, OpenSearch will trigger a local recovery for this shard.

For remote store enabled indices, we will always trigger recovery from remote . However since all the data is written durably in translog and segment store, we should be good here . But we should verify this behavior for remote enabled and disabled indices.

@RS146BIJAY
Copy link
Contributor Author

Thanks @gbbafna for feedback. Will verify this behaviour while working on POC.

@RS146BIJAY
Copy link
Contributor Author

RS146BIJAY commented Jun 14, 2023

Would appreciate feedback from @shwetathareja, @nknize, @Bukhtawar, @sachinpkale, @muralikpbhat, and @reta as well, so tagging more folks for visibility. Thanks!

@msfroh
Copy link
Collaborator

msfroh commented Jun 14, 2023

If the disk fills up, then pending writes for all shards writing to the given volume will fail. (Well, at least flushing those pending writes will fail.)

Can we just close and reopen the IndexWriter for all shards (on the given volume)? That should clean up the unused files.

@RS146BIJAY
Copy link
Contributor Author

Actually we did explored that solution. But it seems we need to handle a lot of edge cases and approach became quite complicated. Following is the approach along with pros and cons:

Fail just the IndexWriter

Another approach to handle unreferenced file cleanup is allow only IndexWriter to close without allowing entire shard to fail (without failing InternalEngine). Here we will perform the cleanup after the segment merge fails, in a callback function (handleMergeException) provided by EngineMergeScheduler. After Lucene closes the IndexWriter and releases the lock from the shard directory, handleMergeException callback function of EngineMergeScheduler will be called.

We will perform a check in this function whether merge failure is because of no space left on the device. If it is, we will skip failing the engine and create a new instance of IndexWriter using the same shard directory for cleaning up unreferenced files. Once cleanup is done, the next step is to reopen IndexWriter closed be Lucene because of disk full. For this, we will swap the current IndexWriter with the new IndexWriter instance created for file cleanup.

There are few edge cases which we need to consider for the above approach. Since handleMergeException callback is called after every merge failure of a shard, it will be called multiple times in case multiple Segment merges corresponding to a shard failed. In such a scenario, we should ensure that OpenSearch creates only a single instance of IndexWriter for cleanup and each thread does not end up creating their own IndexWriter instance. This will be done with the help of a new Reentrant Lock and a global flag which will be set only once after cleanup completes.

Besides this, OpenSearch periodically refreshes the reader to provide “near real-time” searching, where writes made during an IndexWriter session can be quickly made available for searching without closing the writer or calling commit. OpenSearch does this by closing the previous Reader instance and swapping it with the new Reader instance. Since we are closing the previous instance of IndexWriter after segment merge fails, Reader (created using previous writer instance) refresh will throw an AlreadyClosedException. So after we create a new IndexWriter and replace it with old instance, we also need to create new instances of IndexReaderManager using the new IndexWriter instance and swap it with older instances inside InternalEngine.

Also during the time when IndexWriter is closed and before it is reopened, Reader refresh should not be allowed. This is because Reader refresh during this period will throw an AlreadyClosedException error. Since at this time, IndexWriter is closed but the Engine is still opened, OpenSearch will consider it as an invalid state and will re-throw an AssertionError . This exception will be propagated to transport layer of OpenSearch killing OS process. In order to handle this, we can reuse the global flag to disable reader refresh till the time we reopen both IndexReader and IndexWriter.

Pros

  1. We do not need to fail the shard in this case, so shard does not get in unassigned state in this approach.

Cons

  1. This approach is not a clean approach as we need to handle multiple edge case scenarios like prevent reader refresh when clean up is ongoing.
  2. Callback handleMergeException provided by EngineMergeScheduler is called whenever any segment merge fails. If multiple merges are running at the same time for the same Shard (but different segments) we need to handle multithreading scenarios like cleanup should happen only from one thread.
  3. Since shard has not failed yet, it can be possible that another parallel operation like relocation, etc gets triggered when unreferenced file clean up is ongoing. This may corrupt the shard.

@reta
Copy link
Collaborator

reta commented Jun 15, 2023

@RS146BIJAY after reading the Apache Lucene issue(s) threads, I would agree with @gbbafna that the second option (failing shard + cleanup inside the OpenSearch) is (to me) the way to move forward, with the caveat:

In order to prevent cleanup operation and segment merge to not to get in a loop (segment merge → clean up → segment merge), we can further change OpenSearch merge policy to not allow segment merges in case enough amount of space is not available.

It was highlighted there (apache/lucene#12228 (comment)), we should not ignore that but do hard failure (keep shard UNASSIGNED? move index to read-only?)

@shwetathareja
Copy link
Member

Thanks @RS146BIJAY for the proposal and evaluating different approaches. The Option 2 makes sense, considering Option 1 is not feasible and simply reopening the IndexWriter has multiple edge cases.

To @reta 's point

It was highlighted there (apache/lucene#12228 (comment)), we should not ignore that but do hard failure (keep shard UNASSIGNED? move index to read-only?)

+1 if the disk is indeed full, then we shouldn't continue to take writes when segment merges are failing / stopped by the merge policy in the background. But InternalEngine might not be the best place to keep the shard unassigned or read-only and take this decision in isolation. I am wondering should DiskThresholdMonitor start collecting information about MergePolicy failures and consider them also when evaluating watermarks.

Also this error needs to surface to user that merges have been stopped due to not enough space available for merges.

@RS146BIJAY
Copy link
Contributor Author

RS146BIJAY commented Jun 16, 2023

Thanks @reta and @shwetathareja for the feedback. Just one query, why do we think we need to do hard failure after cleanup? Till the time cleanup happens and shard recovery completes, we won't be performing any writes. Also we are avoiding subsequent merges after cleanup through modifying merge policy to not to allow merge if enough amount of space is not present. Is it because number of segments can grow in this approach because merges will not happen after certain point of time?

@reta
Copy link
Collaborator

reta commented Jun 16, 2023

Is it because number of segments can grow in this approach because merges will not happen after certain point of time?

That is correct (merges are vital for Apache Lucene)

@Bukhtawar
Copy link
Collaborator

Can we put a write block on merge failure till we cleanup unreferenced files to avoid piling up more merges. Now if we don't fail the shard will it continue to go into loops of merge failures? If yes at some point we need to re-allocate or worst case fail the shard. If we reclaim enough space we remove the block else let the DiskThresholdMonitor lazily remove it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage Issues and PRs relating to data and metadata storage
Projects
None yet
Development

No branches or pull requests

7 participants