Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filestore] Multitablet filesystems #1350

Closed
qkrorlqr opened this issue Jun 5, 2024 · 7 comments
Closed

[Filestore] Multitablet filesystems #1350

qkrorlqr opened this issue Jun 5, 2024 · 7 comments
Assignees
Labels
2024Q2 filestore Add this label to run only cloud/filestore build and tests on PR

Comments

@qkrorlqr
Copy link
Collaborator

qkrorlqr commented Jun 5, 2024

Right now one FS == one IndexTablet which is a bottleneck for:

  • fetching and updating block layer index info upon read and write requests - severely limits our read/write IOPS
  • fetching and updating byte layer data upon unaligned read and write requests
  • processing file metadata requests like open/close/stat/create/delete

It's also a limiting factor for max FS size because a single tablet needs to be able to store the whole block layer index - this index can't be too big. In fact handling a 100+TiB FS is a challenge already.

We need to be able to provide linear scalability for FS size and single file-level ops:

  • any kind of file IO
  • open/close/stat calls

The suggested solution is to make N + 1 tablets for a single FS where N will be determined based on the FS size upon FS creation. First versions may even require manual creation of the additional N tablets. 1 tablet would store the directory inodes and node refs. The refs which point to file inodes would point to other N tablets which would store all files directly under the root, the names of the node refs in their root directories may simply be guids.

Technically, file creation and deletion would cause a multi-tablet transaction, but it won't be hard to implement - we don't need a full 2PC here and we can also keep a cache of pre-created 0-size files to be able to serve creation requests without a multi-tablet transaction. Deletion can be served asynchronously - the client wouldn't be able to find a file which was deleted after we delete its last node ref (if there are no open file handles) - so there is no need for a real synchronous multi-tablet transaction here either. Again, the first version doesn't need to have those optimizations and can simply handle multi-tablet transactions in the following way:

  1. in addition to modifying the node refs table we should also add an entry about the requested op (create/delete node) to a log table
  2. as long as we have op entries in the log table we should repeatedly retry those ops - as soon as an op completes, it should be deleted from that table.
  3. the client (TStorageServiceActor) should return proper error code if it sees a mismatch between a node ref and a node (e.g. a ref to a not-yet-created node should be treated as if the file doesn't exist - ENOENT)

What needs to be done in the first version:

  1. support external node references - node refs may point to either node ids local to the tablet or pairs <tabletId, nodeName> for external node references
  2. support external inodes in TStorageServiceActor - it should be ready to receive a node ref in the form of <tabletId, nodeName> in response to its requests which work with file inodes and perform a second request to tabletId - handleId -> <tabletId, nodeName (or nodeId)> cache should be implemented in the first version I suppose
  3. support tablet relations configuration - there should be an ability to tell a tablet to create file inodes in a bunch of other tablets
  4. support the ability to create/delete external inodes

We also need to properly track sessions in slave tablets. I think the easiest way do do that is creating the sessions in the slave tablets by the master tablet upon session creation in the master tablet. Master tablet is then responsible for pinging slave tablet sessions.

@qkrorlqr qkrorlqr added filestore Add this label to run only cloud/filestore build and tests on PR 2024Q2 labels Jun 5, 2024
@qkrorlqr qkrorlqr self-assigned this Jun 5, 2024
qkrorlqr added a commit that referenced this issue Jun 23, 2024
…a minor blockstore test fix (#1483)

* issue-1350: multitablet filesystems api, follower configuration code, automatic session creation in followers, node creation in followers via requests to leader, service uts (#1420)

* issue-1350: multitablet filesystems api, follower configuration code, automatic session creation in followers, simple ut

* issue-1350: fixed ut - we need to always launch node-local tablet/ss/hive proxies

* issue-1350: making NodeIds and HandleIds different in different followers

* issue-1350: node creation in followers via a CreateNode or CreateHandle request to leader

* issue-1350: node creation by CreateHandle via the leader ut + fixes

* issue-1350: GetNodeAttr and ListNodes implementation in the leader

* issue-1350: address some review comments

* stabilizing endpoints_grpc/ut::ShouldHandleClientDisconnection (#1480)

* issue-1350: support for multitablet filesystems in StorageServiceActor (#1446)

* issue-1350: implemented AccessNode,SetNodeAttr,AllocateData,DestroyHandle,XAttr requests forwarding to followers

* issue-1350: implemented WriteData,ReadData-related requests forwarding to followers

* issue-1350: implemented two stage CreateHandle request handling - CreateHandle requests with ParentNodeId + Name should be first sent to leader, then modified and sent to the appropriate follower

* issue-1350: implemented CreateNode request forwarding to follower

* issue-1350: implemented GetNodeAttr request forwarding to follower

* issue-1350: fix

* issue-1350: and finally - ListNodes support for multitablet filesystems - first ListNodesRequest is sent to leader, then GetNodeAttrRequests are sent to the followers

* issue-1350: fixed MultiTabletForwardingEnabled for GetNodeAttr and ListNodes

* issue-1350: StorageServiceActor multitablet forwarding - addressed review comments

* issue-1350: added more multitablet uts, added multitablet fio_index test suite, made some fixes (#1482)

* issue-1350: added more multitablet uts, added multitablet fio_index test suite, made some fixes

* issue-1350: E_EXCLUSIVE flag should be unset in CreateHandle requests which are sent to followers

* fixed cmakelists and includes
@qkrorlqr
Copy link
Collaborator Author

qkrorlqr commented Jul 2, 2024

Screenshot from 2024-07-02 21-18-30
filestore containing 10+1 tablets
fio results for 60 clients x 32 numjobs x 1 iodepth
read: 128KiB x 560k IOPS = 70-75GB/s throughput
write: 1MiB x 20-25k IOPS = 20-25GB/s throughput (unstable, sometimes high vdisk latency percentiles for PutTabletLog spike and performance decreases)

@debnatkh
Copy link
Collaborator

debnatkh commented Jul 4, 2024

Same configuration:

  • filestore containing 10+1 tablets
  • 60 clients x 32 numjobs

mpi run+ior results:

  • mpirun --mca routed direct -H $NODES -np 1920 ior -o /root/mnt/ior_file_4 -t 1m -b 1G -F -C --posix.odirect
access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     33223      33226      0.043396    1048576    1024.00    0.671776   59.17      24.18      59.18      0   
read      68913      68937      0.026574    1048576    1024.00    0.136836   28.52      11.87      28.53      0   
  • mpirun --mca routed direct -H $NODES -np 1920 ior -o /root/mnt/ior_file_5 -t 128k -b 1G -F -C --posix.odirect
access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     5919       47351      0.038011    1048576    128.00     0.693598   332.17     137.91     332.18     0   
read      23958      191680     0.007246    1048576    128.00     0.097548   82.06      41.54      82.06      0   

@qkrorlqr
Copy link
Collaborator Author

Screenshot from 2024-07-10 11-44-20
we reached up to 50GB/s write, 100GB/s read

@qkrorlqr
Copy link
Collaborator Author

qkrorlqr commented Jul 22, 2024

Screenshot from 2024-07-22 20-32-34

60 clients, 32 iodepth per client, 30 tablets
40-45GB/s write

@qkrorlqr
Copy link
Collaborator Author

qkrorlqr commented Aug 25, 2024

If a file is unlinked during listing we may get a ENOENT error upon GetNodeAttr in ListNodesActor and will return E_IO to the client:

<< " listed invalid entry: name " << name.Quote()

https://github.com/ydb-platform/nbs/blob/main/cloud/filestore/libs/storage/service/service_actor_listnodes.cpp#L337 - this needs to be done asap but in a slightly different way: upon getting this error we indeed need to remove this node from the response but we also need to stat this node in the leader and check whether it was actually deleted (doesn't exist anymore or the name exists but points to a different followerId+followerName).

@qkrorlqr
Copy link
Collaborator Author

Will implement automatic creation of the shards based on FS size and will close this issue after that. Everything else's done.

@qkrorlqr
Copy link
Collaborator Author

qkrorlqr commented Sep 2, 2024

Will implement automatic creation of the shards based on FS size and will close this issue after that. Everything else's done.

Decided to move this to a separate issue: #1932

@qkrorlqr qkrorlqr closed this as completed Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024Q2 filestore Add this label to run only cloud/filestore build and tests on PR
Projects
None yet
Development

No branches or pull requests

2 participants