-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Design Proposal] Add Remote Storage Options for Improved Durability #2700
Comments
Started adding low level design details for Remote Segment Store. |
I'll offer some opinions on a few open questions :)
As long as it does not introduce new dependencies, then I think it should probably not be a plugin. The concrete implementation for interfacing with a remote store should be a plugin (like the current repository plugins), but the core logic of this feature can probably live directly inside the core and need not be a separate plugin.
I think the key guarantee needs to be that all acknowledged writes are persisted to the remote store. The specific durability guarantee will differ depending on the remote store, but I think OpenSearch's guarantee should be that the document is successfully persisted to the remote store before the write is acknowledged as successful to the user.
Is this new "durability" feature a property of the cluster or of an index? Assuming it is a property of an index, then I would assume you can enable for a new index at index creation time. Given that, a user should be able to use the reindex mechanism to effectively enable remote durability for an existing index by reindexing it to a new index with durability enabled. There may be future optimizations to make this process more lightweight, but I'd strongly consider leaving that out of scope for the initial version. (If durability is intended to be enabled at a cluster level and not per-index, then my comment doesn't really make sense)
As stated above, I'd try to leave backfilling out of scope for the initial version if the reindex mechanism can work.
What exactly do you mean by "snapshot implementation"? I'd hope the existing repository interface can be used/extended to meet the needs of copying segment files to/from the remote store. Not sure about the translog as it will likely have a very different access pattern. |
Nice proposal, few comments/questions below.
|
Searchable remote index is currently focused on supporting remote indices in snapshot format. As @sohami replied to one of the comments #2900 (comment), the plan is to support remote indices that will be stored as a part of this proposal. We need to work closely to understand the overlap and possible impact on the design.
As of now, we are thinking of providing sync/async way to syncing data to the remote store. Sync mechanism will impact performance (for remote translog) and async mechanism will impact durability guarantees. We need to make the trade-off clear and user needs to choose based on the requirement. Having said this, the exact impact will be clear once we carry out performance tests.
It does but durability guarantees of different remote stores will be different. We called it out in feature proposal as one of the considerations. I will add the same in this doc as well.
In V1, we want to support only Segment Replication. Due to the same set of segments in primary and replica, Segment Replication makes it easy for the initial implementation of the remote store (more details here). We will support document replication in the subsequent releases.
It would be in addition to the LocalDirectory. Class diagram does not make it clear. I will update it to reflect both.
Good point! It definitely adds an overhead of implementing corresponding Repository and RepositoryPlugin. But the amount of code requires to implement these is minimal. Example: AzureRepository and AzureRepositoryPlugin.
Agree. My approach was to provide API in V1 and to provide automated approach in V2. But we can have automated restore as a part of V1 as well.
Yes, snapshot API can be changed to take a query parameter where it understands to fetch segment and translog data from the remote store. |
Adding scope for V1 (Refer Meta Issue for tasks break-down) ScopeIn V1, remote store will only be supported for newly created indices for which segment replication is also enabled. Setup
StorageFor an index with setting
Restore
|
Thanks @sachinpkale for the detailed plan for V1 scope. Is there any reason why you set segment replication as prerequisite for remote storage? My understanding you should be able to copy segments to remote storage regardless of the replication method. |
Yes @anasalkouz, you are right. The segments will be copied to remote storage irrespective of the replication method. |
Thanks @sachinpkale for the proposal, overall looks great.
|
Hi @sachinpkale. There are a few other requirements that are worth calling out for this to work with segment replication. While we don't have the first wave of segrep merged into main yet from our feature branch, we have a general idea of the files that need to be copied to/from a store. Primaries will need to include a byte[] of its current SegmentInfos object in addition to the segments. The SegmentInfos is required so that we can copy out after refresh and not wait for the next commit point. I don't think anything in the directory API exists that would push this, it would be an addition. Second, we'll need a way to fetch this SegmentInfos on replicas in addition to the segment files. The general flow I'm thinking of is similar to what we have in the feature branch & in the proposal, with an interface (SegmentReplicationSource) extracted so we can implement for both node->node replication or a remote store using the same flow.
I'm thinking this SegmentReplicationSource interface would be something like.
Thoughts? |
An alternative to pushing the SegmentInfos bytes in my previous comment is that with remote store enabled we change the behavior of a refresh to instead perform a Lucene commit. This way we continuously push an updated commit point / segments_N file to the remote store. We can disable the fsync on the local directory implementation to make commits less expensive because segments will be durably stored remotely. This would remove the need for a change to the directory implementation to push the SegmentInfos bytes. I'm not sure of the perf trade-off here. Either way I think we can include the SegmentInfos object itself in |
@mch2 Thanks for the heads up. I am inclined towards the first approach. In remote store, we can have another file (with suitable prefix/suffix to
Lucene commit is expensive and OpenSearch internally performs various operations (like purging translog) when flush is called. As commit makes things durable on local store, changing the behaviour would need to understand the complete commit flow (which I don't currently). |
@mch2 based on the alternative proposal wouldn't it create bi-modal code paths which would couple itself up with the storage layer? I wouldn't recommend this. Lucene commit as @sachinpkale pointed out is certainly expensive and might need modifications in the OpenSearch flush flow as well. |
SegmentInfos is updated as new segments are created, on a refresh those new segments will be flushed to disk but the _N file is not updated. segments_N is only updated & flushed to disk on a commit. @Bukhtawar. Sorry, I meant conceptually to perform more frequent commits instead of the existing refresh behavior, not splitting the code path of a refresh. I'd be curious how much more expensive commits are with fsyncs turned off. This would mean we could always write & re-read SegmentInfos from a stable _N file. If we don't have a function on the Directory to push the byte[] as a stream writing the tmp file is not a problem. This would use similar logic to what SegmentInfos uses already to write for commits - SegmentInfos.write, we are already using the public write function to write to a byte[]. I wrote a little unit test to see that it works, the name of the file would need to be different to not conflict with the latest commit point. public void testWriteSegmentInfos() throws Exception {
List<Engine.Operation> operations = generateHistoryOnReplica(
between(1, 500),
randomBoolean(),
randomBoolean(),
randomBoolean()
);
for (Engine.Operation op : operations) {
applyOperation(engine, op);
}
final SegmentInfos latestSegmentInfos = engine.getLatestSegmentInfos();
long nextGeneration = latestSegmentInfos.getGeneration();
final String FILE_PREFIX = "latest_infos";
String segmentFileName =
IndexFileNames.fileNameFromGeneration(FILE_PREFIX, "", nextGeneration);
try (IndexOutput output = store.directory().createOutput(segmentFileName, IOContext.DEFAULT)) {
latestSegmentInfos.write(output);
}
try (final ChecksumIndexInput checksumIndexInput = store.directory().openChecksumInput(segmentFileName, IOContext.DEFAULT)) {
final SegmentInfos segmentInfos = SegmentInfos.readCommit(store.directory(), checksumIndexInput, latestSegmentInfos.getGeneration());
assertEquals(segmentInfos.files(true), latestSegmentInfos.files(true));
}
} |
@mch2 If we take remote store out of the picture, where are we planning to store SegmentInfos bytes[]? Will it be in memory of primary? Can we create a new file which is appended with new data (or if we want this file immutable, we can replace the file) each time refresh happens? This way, nothing changes once remote store comes into picture. |
@sachinpkale For node-node replication the byte[] isn't written to disk, we pull the primary's latest SegmentInfos in memory, ensure all segments it references are incRef'd for the duration of the copy event and then serialize it over the wire.
We could, this is what the snippet above does. I would prefer we write the object directly to the remote store and avoid the extra flush to the local disk if we can. I think this is possible with what you've implemented in #3460 although theres an extra step of converting the IndexOutput back to Input. Something like... ByteBuffersDataOutput buffer = new ByteBuffersDataOutput();
try (ByteBuffersIndexOutput tmpIndexOutput = new ByteBuffersIndexOutput(buffer, "temporary", "temporary")) {
segmentInfos.write(tmpIndexOutput);
}
final BufferedChecksumIndexInput indexInput = new BufferedChecksumIndexInput(
new ByteBuffersIndexInput(buffer.toDataInput(), "SegmentInfos")
);
copyBytes(indexInput...) |
@mch2 Please take a look at commit: sachinpkale@b195a2a#diff-45c1ec7b5c456e144972500c9883018f4e74f4e7d054c448bf44ef308dbbb8e6R116-R123. In this commit, we are uploading intermediate This will help remote store to keep track of uploaded files as well. It enables us to delete merged away segments in async flow. |
@sachinpkale I'd like to help. Are there some small tasks that I can get started with? Thanks |
Goal
This doc proposes a high level and low level design to provide durability guarantees for the indexed data using remote store. For more info on overview and challenges, please refer: Feature Proposal
Requirements
Functional
Non-Functional
Out of Scope
This design doc focuses on durability aspect of using the remote store. Even though the design will be extensible to integrate remote store with other constructs, those details will not be discussed here.
Approach
Hot data storage in OpenSearch is divided into two parts:
Successful indexing operation writes data to Translog. A periodic job consumes translog and creates segment files. Translog is purged once Lucene commit is triggered. In order to provide durability guarantees of the indexed data, we need to provide remote store support for translog as well as segments. In the following sections, we use
remote translog
andremote segment store
to store translog and segments respectively.Invariant
At any point in time, remote translog and remote segment store together contain all the indexed data.
Storage
Restore
High Level Design
Architecture Diagram
Remote Translog
Data Model
ToDo
Requirements
ToDo
Remote Segment Store
Data Model
index_UUID/shard_id
hierarchy (the exact hierarchy may have more elements) in the remote store while storing the segment files.Requirements
API Design
ToDo: Add verb, payload, success and error response against each API
Low Level Design
Remote Translog
ToDo
Remote Segment Store
Class Diagram
The interface for remote segment storage (
RemoteDirectory
) will be extended from Lucene’s Directory. Using APIs provided by Directory will ensure that remote segment store is able to provide the same functionality that local disk does. Using the same APIs for local disk and remote segment store will also make it extensible where local disk can be replaced by the remote segment store (this is not part of durability but should avoid re-architecture of the APIs/system later)Note: RemoteDirectory may not support all the APIs of Directory from the first release. Plan is to open up more and more Directory APIs based on the milestone targeted for the upcoming release.
Today, with Snapshots, OpenSearch already supports writing segment files to remote blob store. It also supports reading these segment files while restoring data from segments. BlobContainer interface is defined to read and write segments (blobs) from a blob store (HDFS, Azure Blob Storage, S3).
RemoteDirectory will have referece to BlobContainer and will delegate Directory API implementation to BlobContainer with the required pre and post processing. This will avoid duplicating the same code to read/write segment files from/to the remote store.
Sequence Diagram
Segment Upload
Segment Restore
Scope
In V1, remote store will only be supported for newly created indices for which segment replication is also enabled.
Setup
repository-hdfs
,repository-azure
,repository-s3
,repository-gcs
segment_replication
andremote_store
enabled.remote_store
setting is non-dynamic. Once index is created, this setting can not be changed.Storage
For an index with setting
segment_replication = true
andremote_store = true
:Restore
Lifecycle of data in remote store
V1
Store Data
remote_store=true
setting. As mentioned in the scope, remote store V1 is only supported for new indices created withremote_store=true
setting.Restore Data
Integration with existing constructs/features
Failover
ToDo: Evaluating different options here
Peer Recovery
ToDo
Replication Strategy
Document Based Replication
ToDo
Segment Based Replication
ToDo
Cross Cluster Replication
ToDo
Snapshot
ToDo
Point-In-Time Recovery
ToDo
Metrics/Stats
ToDo
Open Questions
The text was updated successfully, but these errors were encountered: