-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Filestore] Multitablet filesystems #1350
Comments
…a minor blockstore test fix (#1483) * issue-1350: multitablet filesystems api, follower configuration code, automatic session creation in followers, node creation in followers via requests to leader, service uts (#1420) * issue-1350: multitablet filesystems api, follower configuration code, automatic session creation in followers, simple ut * issue-1350: fixed ut - we need to always launch node-local tablet/ss/hive proxies * issue-1350: making NodeIds and HandleIds different in different followers * issue-1350: node creation in followers via a CreateNode or CreateHandle request to leader * issue-1350: node creation by CreateHandle via the leader ut + fixes * issue-1350: GetNodeAttr and ListNodes implementation in the leader * issue-1350: address some review comments * stabilizing endpoints_grpc/ut::ShouldHandleClientDisconnection (#1480) * issue-1350: support for multitablet filesystems in StorageServiceActor (#1446) * issue-1350: implemented AccessNode,SetNodeAttr,AllocateData,DestroyHandle,XAttr requests forwarding to followers * issue-1350: implemented WriteData,ReadData-related requests forwarding to followers * issue-1350: implemented two stage CreateHandle request handling - CreateHandle requests with ParentNodeId + Name should be first sent to leader, then modified and sent to the appropriate follower * issue-1350: implemented CreateNode request forwarding to follower * issue-1350: implemented GetNodeAttr request forwarding to follower * issue-1350: fix * issue-1350: and finally - ListNodes support for multitablet filesystems - first ListNodesRequest is sent to leader, then GetNodeAttrRequests are sent to the followers * issue-1350: fixed MultiTabletForwardingEnabled for GetNodeAttr and ListNodes * issue-1350: StorageServiceActor multitablet forwarding - addressed review comments * issue-1350: added more multitablet uts, added multitablet fio_index test suite, made some fixes (#1482) * issue-1350: added more multitablet uts, added multitablet fio_index test suite, made some fixes * issue-1350: E_EXCLUSIVE flag should be unset in CreateHandle requests which are sent to followers * fixed cmakelists and includes
Same configuration:
mpi run+ior results:
|
If a file is unlinked during listing we may get a ENOENT error upon GetNodeAttr in ListNodesActor and will return E_IO to the client:
https://github.com/ydb-platform/nbs/blob/main/cloud/filestore/libs/storage/service/service_actor_listnodes.cpp#L337 - this needs to be done asap but in a slightly different way: upon getting this error we indeed need to remove this node from the response but we also need to stat this node in the leader and check whether it was actually deleted (doesn't exist anymore or the name exists but points to a different followerId+followerName). |
Will implement automatic creation of the shards based on FS size and will close this issue after that. Everything else's done. |
Decided to move this to a separate issue: #1932 |
Right now one FS == one IndexTablet which is a bottleneck for:
It's also a limiting factor for max FS size because a single tablet needs to be able to store the whole block layer index - this index can't be too big. In fact handling a 100+TiB FS is a challenge already.
We need to be able to provide linear scalability for FS size and single file-level ops:
The suggested solution is to make N + 1 tablets for a single FS where N will be determined based on the FS size upon FS creation. First versions may even require manual creation of the additional N tablets. 1 tablet would store the directory inodes and node refs. The refs which point to file inodes would point to other N tablets which would store all files directly under the root, the names of the node refs in their root directories may simply be guids.
Technically, file creation and deletion would cause a multi-tablet transaction, but it won't be hard to implement - we don't need a full 2PC here and we can also keep a cache of pre-created 0-size files to be able to serve creation requests without a multi-tablet transaction. Deletion can be served asynchronously - the client wouldn't be able to find a file which was deleted after we delete its last node ref (if there are no open file handles) - so there is no need for a real synchronous multi-tablet transaction here either. Again, the first version doesn't need to have those optimizations and can simply handle multi-tablet transactions in the following way:
What needs to be done in the first version:
We also need to properly track sessions in slave tablets. I think the easiest way do do that is creating the sessions in the slave tablets by the master tablet upon session creation in the master tablet. Master tablet is then responsible for pinging slave tablet sessions.
The text was updated successfully, but these errors were encountered: