[RFC] Optimized Prefix Pattern for Shard-Level Files for Efficient Snapshots #15146

ashking94 · 2024-08-07T10:24:56Z

TL;DR - This RFC, inspired by #12567, proposes an optimized prefix pattern for shard-level files in a snapshot.

Problem statement

Snapshots are backups of a cluster's indexes and state, including cluster settings, node information, index metadata, and shard allocation information. They are used to recover from failures, such as a red cluster, or to move data between clusters without loss. Snapshots are stored in a repository in a hierarchical manner that represents the composition of shards, indexes, and the cluster. However, this structure poses a scaling challenge when there are numerous shards due to limitations on concurrent operations over a fixed prefix in a remote store. In this RFC, we discuss various aspects to achieve a solution that scales well with a high number of shards.

Current repository structure

Below is the current structure of the snapshot -

Image used from https://opensearch.org/blog/snapshot-operations/.

The files created once per snapshot or once per index per snapshot are somewhat immune to throttling due to their fewer numbers and are uploaded by the active cluster manager using only five snapshot threads. These files include:

index-N (RepositoryData)
index.latest (Latest generation of RepositoryData to refer to)
incompatible-snapshots (lust of snapshot ids that are no longer compatible with current OpenSearch version)
snap-uuid.dat (SnapshotInfo for snapshot )
meta-uuid.dat (Metadata for snapshot )
indices/Xy1234-z_x/meta-uuid.dat (IndexMetadata for the index)

Files susceptible to throttling are created on data nodes, generally per primary shard per snapshot:

indices/Xy1234-z_x/0/__VP05oDMVT & more (Files mapped to real segment files)
indices/Xy1234-z_x/0/snap-uuid.dat (BlobStoreIndexShardSnapshot)
indices/Xy1234-z_x/0/index-uuid (BlobStoreIndexShardSnapshots)
Similar folders for different shards

For index with snapshot uuid Xy1234-z_x. Similarly, there will be more number of folder for different indexes.

Issue with current structure

The existing structure leads to throttling in clusters with a high shard count, resulting in longer snapshot creation and deletion times due to retries. In worst-case scenarios, this can lead to partial or failed snapshots.

Requirements

Ensure smooth snapshot operations (create, delete) without throttling for high shard counts.
Implement a scalable prefix strategy to handle increasing concurrency on the remote store as shard count grows.

Proposed solution

Introduce a prefix pattern accepted by multiple repository providers (e.g., AWS S3, GCP Storage, Azure Blob Storage) that maximizes data spread across prefixes for better scaling. This prefix strategy will be applied to shard-level files. The general recommendation by the providers is to maximise the spread of data across as many prefixes as possible. This allows them to scale better. I propose to introduce this prefix strategy for shard level files. This has been introduced already in #12567 for remote store shard level data & metadata files.

Key changes in the proposed structure:

Shard-level path now includes a hash prefix: hash(index-name-or-uuid, shard-id)//indices//
With this prefix pattern, each shard will get a unique prefix that will be prepended at the start.
If a base path is present, it would appear between the hash prefix and "indices"

High level approach

Store the path type in customData within IndexMetadata, which is already stored during snapshot creation.

Cloud neutral solution

This approach is supported by multiple cloud providers:

GCP Storage - https://cloud.google.com/storage/docs/request-rate#ramp-up
AWS S3 - https://repost.aws/knowledge-center/http-5xx-errors-s3
Azure blob storage - https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist

Proposed repository structure

<ROOT>
├── 210101010101101
│   └── indices
│       └── 1
│           └── U-DBzYuhToOfmgahZonF0w
│               ├── __C-Fa1S-NSbqEWJU3h1tdVQ
│               ├── __EeKqbgKCRHuHn-EF5GzlJA
│               ├── __JS_z4nUURmSsyFVsmgmBcA
│               ├── __a2cczAV8Sg2hcqs1uQBryA
│               ├── __d0CF9LqhTN6REWmuLzO96g
│               ├── __gGvHzcveQwiXknAWxor2IQ
│               ├── __jlv0IqxNRGqbmXZNQ3cr4A
│               ├── __oEfbUhM3QvOh7EDg1lzlnQ
│               ├── index-Y99a4XQjRwmVM5s5uIrFRw
│               ├── snap-L7HfWiUYSj6as4jSi3yKeg.dat
│               └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
├── 910010111101110
│   └── indices
│       └── 0
│           └── LsKqUfn2ROmjAkW6Rec1lQ
│               ├── __8FggEzgeRXaXUQY3U6Xlmg
│               ├── __F1iiaK-uTcmUD608gQ54Ng
│               ├── __JW0cFZiVTzaAOdzK03bsEw
│               ├── __N19wYbk0TVOqkU8NJw6ptA
│               ├── __dE6opnQ6QLmChNwQ4i3L6w
│               ├── __hylfOCALSPiLwRNfR27wbA
│               ├── __rNFtqVvuRnOomXxeF0hLfg
│               ├── __t8fnoBiATiqR8N9T2Qiejw
│               ├── index-cEnUDNHXTveF6OR779cKSA
│               ├── snap-L7HfWiUYSj6as4jSi3yKeg.dat
│               └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
├── M11010111001101
│   └── indices
│       └── 0
│           └── U-DBzYuhToOfmgahZonF0w
│               ├── __EKjl8lZlTCuUADJKlCM1GQ
│               ├── __Gd_-r-qlTve8B-VwxKicWA
│               ├── __f9qc9DWARWu6cw41q0pbOw
│               ├── __g_dEX3n3TISxMmO_SDmjYg
│               ├── __jjBTwJCrQKuT6gPxfNRbag
│               ├── __neVNkcaOSjuNAW_J9XiPag
│               ├── __ubsXbHN5QMi6dX0ragf-TQ
│               ├── __yzFDy62NTuildlyJc0DknA
│               ├── index-5RTtDql2Q9CBrBVcKgfI5Q
│               ├── snap-L7HfWiUYSj6as4jSi3yKeg.dat
│               └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
├── index-3
├── index.latest
├── indices
│   ├── LsKqUfn2ROmjAkW6Rec1lQ
│   │   └── meta-eNNJLJEBNa6hCAn8tcSn.dat
│   └── U-DBzYuhToOfmgahZonF0w
│       └── meta-d9NJLJEBNa6hCAn8tcSn.dat
├── meta-L7HfWiUYSj6as4jSi3yKeg.dat
├── meta-QajL9GX7Tj2Itdbm5GzAqQ.dat
├── snap-L7HfWiUYSj6as4jSi3yKeg.dat
├── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
└── w10110111010101
    └── indices
        └── 1
            └── LsKqUfn2ROmjAkW6Rec1lQ
                ├── __ABzT7uaYQq2_WpF22ewuJg
                ├── __PTGQRYT_QyeAZ7llHCHP_Q
                ├── __ROWGAvOQRWarW9K6c_8BuQ
                ├── __RYbp_hw2Qb2T9rEECYZM5w
                ├── __YFbMl9ZNRMqm9i7ItDkz4Q
                ├── __bGaQqDsPR--08ifzAOA23A
                ├── __dTPVxsjCTfqeUg37B8_zUg
                ├── __z1N6Uz2MR26GoNjLuU_LQA
                ├── index-Mv_2sClsS1Wf0ynx8c697g
                ├── snap-L7HfWiUYSj6as4jSi3yKeg.dat
                └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat

In the above example, we have changed the shard level path to have 1st hash 210101010101101, 2nd indices, 3rd shard-id, 4th uuid (this is generated per index name by snapshot) and then all the files.
In the above example the base path is empty or not there. If there was a base path, then the base path would show up between 1st and 2nd word.

Appendix

Sample current repository structure

<BASE_PATH>
├── index-1
├── index.latest
├── indices
│   ├── OQb-77UPRLmek3IzfRR3Fg
│   │   ├── 0
│   │   │   ├── __3lX2qNtkShSMD2zhbW2hww
│   │   │   ├── __5ILryCGbSd-ojWvIxAK7Mw
│   │   │   ├── __UHpuUxuoTyqrc5kzCmSAHA
│   │   │   ├── __c7Tuu2zRSROfiy7UCq7KOQ
│   │   │   ├── __d1jt099nQSycUnO4mx9XXQ
│   │   │   ├── __oUVAcviJRdu0kWqYQyLrnA
│   │   │   ├── index-GoobOcTzQmqG5LI67Y7dvg
│   │   │   ├── snap-4JlPnU7nT5O1dzW7SP318Q.dat
│   │   │   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
│   │   ├── 1
│   │   │   ├── __DRjMuHHKRa-FluuZl94qLQ
│   │   │   ├── __EDoFw1gqQQastQjG8bat1Q
│   │   │   ├── __GypZv7z0SUu6LNiML4sefQ
│   │   │   ├── __ZP1jsyQnTQOMIO9Vx2n5Eg
│   │   │   ├── __nrm3d_o9Skqw3tAyyPnr7A
│   │   │   ├── __qI-jax0aTaycxgZkzsKmRw
│   │   │   ├── index-WsEwZHu6QZ-_Ub5h8p0AGA
│   │   │   ├── snap-4JlPnU7nT5O1dzW7SP318Q.dat
│   │   │   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
│   │   └── meta-UV9P7pABMtEAgBnTfhV6.dat
│   └── gYpI7EgrQlqGqiKBWnp1bw
│       ├── .DS_Store
│       ├── 0
│       │   ├── __ATpweBS4RV2e7-iWmOItfg
│       │   ├── __Kv_a0dSNQfyzSlkiONETnw
│       │   ├── __Q3EuuawlQQq2Cm2rHZpYRw
│       │   ├── __kHrmxfa5RAuHhjEEeP8czA
│       │   ├── index--0B07yZUQr63TPhdStkDzQ
│       │   ├── snap-4JlPnU7nT5O1dzW7SP318Q.dat
│       │   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
│       ├── 1
│       │   ├── __N2Imh37BSQGwFVmg5GQAsg
│       │   ├── __OVbkh2p-QxGjFfMM-XuQ-g
│       │   ├── __r2989-oaR_S--xOC-1Mlgw
│       │   ├── __r5yJ8QY-TdOK9Uy5cmozHw
│       │   ├── index-oqu0C2_PTXamVfS7BCpH1A
│       │   ├── snap-4JlPnU7nT5O1dzW7SP318Q.dat
│       │   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
│       └── meta-UF9P7pABMtEAgBnTfhV6.dat
├── meta-4JlPnU7nT5O1dzW7SP318Q.dat
├── meta-I0esTsK9Qiy114LI1xFD4Q.dat
├── snap-4JlPnU7nT5O1dzW7SP318Q.dat
└── snap-I0esTsK9Qiy114LI1xFD4Q.dat

The text was updated successfully, but these errors were encountered:

sachinpkale · 2024-08-07T17:03:25Z

Thanks @ashking94 for the proposal.

In the above example the base path is empty or not there. If there was a base path, then the base path would show up between 1st and 2nd word.

Currently, entire snapshot contents (data + metadata) are stored within the provided base path. With new approach, this is not true anymore (correct me if I am wrong).

This may change the way user organises snapshots (For example, I would be running 5 clusters each using repository-s3 plugin for snapshot and uses bucket s3://snapshot_bucket with base path as cluster1/2/3/4/5 and cleanup the corresponding base path once a particular cluster is no longer needed and deleted).

To keep the existing behaviour consistent, does it make sense to introduce the proposed changes as a new type of snapshot?

ashking94 · 2024-08-08T03:54:38Z

Currently, entire snapshot contents (data + metadata) are stored within the provided base path. With new approach, this is not true anymore (correct me if I am wrong).

That's right.

This may change the way user organises snapshots (For example, I would be running 5 clusters each using repository-s3 plugin for snapshot and uses bucket s3://snapshot_bucket with base path as cluster1/2/3/4/5 and cleanup the corresponding base path once a particular cluster is no longer needed and deleted).

We will have a fixed path where we will be uploading the paths for all the different paths where the data is stored corresponding to a cluster. Also, this will be by default disabled on a cluster for not breaking backward compatibility.

To keep the existing behaviour consistent, does it make sense to introduce the proposed changes as a new type of snapshot?

I am still debating in my head how to support the new mode with data that has been uploaded already in the fixed path. Apart from that, we would also need capability to have the new mode only for repository r1, but not repository r2. Let me cover these in the PRs.

harishbhakuni · 2024-08-22T07:32:14Z

Thanks @ashking94 for the proposal. On the similar lines of what sachin mentioned. Today an admin level user can give path level access to different users in a bucket and those users can provide those paths as base paths and can use it across multiple clusters. looks like this feature would not work for those usecases. to use it, user must have root level access on the bucket.

ashking94 · 2024-08-23T10:22:14Z

Thanks @ashking94 for the proposal. On the similar lines of what sachin mentioned. Today an admin level user can give path level access to different users in a bucket and those users can provide those paths as base paths and can use it across multiple clusters. looks like this feature would not work for those usecases. to use it, user must have root level access on the bucket.

Thanks for your comment, @harishbhakuni. The access would need to be provided at the bucket level to the cluster. Having more cluster would allow the autoscaling to work even better. We definitely need to build some mechanisms to segregate access to the domain level paths based on the base path substring in the key path.

ashking94 added enhancement Enhancement or improvement to existing feature or request untriaged labels Aug 7, 2024

github-actions bot added the Storage:Snapshots label Aug 7, 2024

ashking94 added discuss Issues intended to help drive brainstorming and decision making RFC Issues requesting major changes and removed untriaged labels Aug 7, 2024

This was referenced Aug 27, 2024

Add support to upload snapshot shard blobs with hashed prefix #15426

Merged

[DOC] Create documentation for snapshots with hashed prefix path type opensearch-project/documentation-website#8143

Closed

ashking94 closed this as completed in #15426 Sep 1, 2024

ashking94 mentioned this issue Sep 9, 2024

Create documentation for snapshots with hashed prefix path type opensearch-project/documentation-website#8196

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Optimized Prefix Pattern for Shard-Level Files for Efficient Snapshots #15146

[RFC] Optimized Prefix Pattern for Shard-Level Files for Efficient Snapshots #15146

ashking94 commented Aug 7, 2024 •

edited

Loading

sachinpkale commented Aug 7, 2024

ashking94 commented Aug 8, 2024

harishbhakuni commented Aug 22, 2024 •

edited

Loading

ashking94 commented Aug 23, 2024

[RFC] Optimized Prefix Pattern for Shard-Level Files for Efficient Snapshots #15146

[RFC] Optimized Prefix Pattern for Shard-Level Files for Efficient Snapshots #15146

Comments

ashking94 commented Aug 7, 2024 • edited Loading

Problem statement

Current repository structure

Issue with current structure

Requirements

Proposed solution

High level approach

Cloud neutral solution

Proposed repository structure

Appendix

Sample current repository structure

sachinpkale commented Aug 7, 2024

ashking94 commented Aug 8, 2024

harishbhakuni commented Aug 22, 2024 • edited Loading

ashking94 commented Aug 23, 2024

ashking94 commented Aug 7, 2024 •

edited

Loading

harishbhakuni commented Aug 22, 2024 •

edited

Loading