[Filestore] Multitablet filesystems #1350

qkrorlqr · 2024-06-05T18:19:43Z

Right now one FS == one IndexTablet which is a bottleneck for:

fetching and updating block layer index info upon read and write requests - severely limits our read/write IOPS
fetching and updating byte layer data upon unaligned read and write requests
processing file metadata requests like open/close/stat/create/delete

It's also a limiting factor for max FS size because a single tablet needs to be able to store the whole block layer index - this index can't be too big. In fact handling a 100+TiB FS is a challenge already.

We need to be able to provide linear scalability for FS size and single file-level ops:

any kind of file IO
open/close/stat calls

The suggested solution is to make N + 1 tablets for a single FS where N will be determined based on the FS size upon FS creation. First versions may even require manual creation of the additional N tablets. 1 tablet would store the directory inodes and node refs. The refs which point to file inodes would point to other N tablets which would store all files directly under the root, the names of the node refs in their root directories may simply be guids.

Technically, file creation and deletion would cause a multi-tablet transaction, but it won't be hard to implement - we don't need a full 2PC here and we can also keep a cache of pre-created 0-size files to be able to serve creation requests without a multi-tablet transaction. Deletion can be served asynchronously - the client wouldn't be able to find a file which was deleted after we delete its last node ref (if there are no open file handles) - so there is no need for a real synchronous multi-tablet transaction here either. Again, the first version doesn't need to have those optimizations and can simply handle multi-tablet transactions in the following way:

in addition to modifying the node refs table we should also add an entry about the requested op (create/delete node) to a log table
as long as we have op entries in the log table we should repeatedly retry those ops - as soon as an op completes, it should be deleted from that table.
the client (TStorageServiceActor) should return proper error code if it sees a mismatch between a node ref and a node (e.g. a ref to a not-yet-created node should be treated as if the file doesn't exist - ENOENT)

What needs to be done in the first version:

support external node references - node refs may point to either node ids local to the tablet or pairs <tabletId, nodeName> for external node references
support external inodes in TStorageServiceActor - it should be ready to receive a node ref in the form of <tabletId, nodeName> in response to its requests which work with file inodes and perform a second request to tabletId - handleId -> <tabletId, nodeName (or nodeId)> cache should be implemented in the first version I suppose
support tablet relations configuration - there should be an ability to tell a tablet to create file inodes in a bunch of other tablets
support the ability to create/delete external inodes

We also need to properly track sessions in slave tablets. I think the easiest way do do that is creating the sessions in the slave tablets by the master tablet upon session creation in the master tablet. Master tablet is then responsible for pinging slave tablet sessions.

…a minor blockstore test fix (#1483) * issue-1350: multitablet filesystems api, follower configuration code, automatic session creation in followers, node creation in followers via requests to leader, service uts (#1420) * issue-1350: multitablet filesystems api, follower configuration code, automatic session creation in followers, simple ut * issue-1350: fixed ut - we need to always launch node-local tablet/ss/hive proxies * issue-1350: making NodeIds and HandleIds different in different followers * issue-1350: node creation in followers via a CreateNode or CreateHandle request to leader * issue-1350: node creation by CreateHandle via the leader ut + fixes * issue-1350: GetNodeAttr and ListNodes implementation in the leader * issue-1350: address some review comments * stabilizing endpoints_grpc/ut::ShouldHandleClientDisconnection (#1480) * issue-1350: support for multitablet filesystems in StorageServiceActor (#1446) * issue-1350: implemented AccessNode,SetNodeAttr,AllocateData,DestroyHandle,XAttr requests forwarding to followers * issue-1350: implemented WriteData,ReadData-related requests forwarding to followers * issue-1350: implemented two stage CreateHandle request handling - CreateHandle requests with ParentNodeId + Name should be first sent to leader, then modified and sent to the appropriate follower * issue-1350: implemented CreateNode request forwarding to follower * issue-1350: implemented GetNodeAttr request forwarding to follower * issue-1350: fix * issue-1350: and finally - ListNodes support for multitablet filesystems - first ListNodesRequest is sent to leader, then GetNodeAttrRequests are sent to the followers * issue-1350: fixed MultiTabletForwardingEnabled for GetNodeAttr and ListNodes * issue-1350: StorageServiceActor multitablet forwarding - addressed review comments * issue-1350: added more multitablet uts, added multitablet fio_index test suite, made some fixes (#1482) * issue-1350: added more multitablet uts, added multitablet fio_index test suite, made some fixes * issue-1350: E_EXCLUSIVE flag should be unset in CreateHandle requests which are sent to followers * fixed cmakelists and includes

qkrorlqr · 2024-07-02T19:20:27Z

filestore containing 10+1 tablets
fio results for 60 clients x 32 numjobs x 1 iodepth
read: 128KiB x 560k IOPS = 70-75GB/s throughput
write: 1MiB x 20-25k IOPS = 20-25GB/s throughput (unstable, sometimes high vdisk latency percentiles for PutTabletLog spike and performance decreases)

debnatkh · 2024-07-04T17:36:17Z

Same configuration:

filestore containing 10+1 tablets
60 clients x 32 numjobs

mpi run+ior results:

mpirun --mca routed direct -H $NODES -np 1920 ior -o /root/mnt/ior_file_4 -t 1m -b 1G -F -C --posix.odirect

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     33223      33226      0.043396    1048576    1024.00    0.671776   59.17      24.18      59.18      0   
read      68913      68937      0.026574    1048576    1024.00    0.136836   28.52      11.87      28.53      0

mpirun --mca routed direct -H $NODES -np 1920 ior -o /root/mnt/ior_file_5 -t 128k -b 1G -F -C --posix.odirect

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     5919       47351      0.038011    1048576    128.00     0.693598   332.17     137.91     332.18     0   
read      23958      191680     0.007246    1048576    128.00     0.097548   82.06      41.54      82.06      0

qkrorlqr · 2024-07-10T09:45:01Z

we reached up to 50GB/s write, 100GB/s read

qkrorlqr · 2024-07-22T18:33:41Z

60 clients, 32 iodepth per client, 30 tablets
40-45GB/s write

qkrorlqr · 2024-08-25T19:33:12Z

If a file is unlinked during listing we may get a ENOENT error upon GetNodeAttr in ListNodesActor and will return E_IO to the client:

nbs/cloud/filestore/libs/vfs_fuse/fs_impl_list.cpp

Line 330 in b4d83e4

<< " listed invalid entry: name " << name.Quote()

https://github.com/ydb-platform/nbs/blob/main/cloud/filestore/libs/storage/service/service_actor_listnodes.cpp#L337 - this needs to be done asap but in a slightly different way: upon getting this error we indeed need to remove this node from the response but we also need to stat this node in the leader and check whether it was actually deleted (doesn't exist anymore or the name exists but points to a different followerId+followerName).

qkrorlqr · 2024-08-29T15:33:04Z

Will implement automatic creation of the shards based on FS size and will close this issue after that. Everything else's done.

qkrorlqr · 2024-09-02T13:20:45Z

Will implement automatic creation of the shards based on FS size and will close this issue after that. Everything else's done.

Decided to move this to a separate issue: #1932

qkrorlqr added filestore Add this label to run only cloud/filestore build and tests on PR 2024Q2 labels Jun 5, 2024

qkrorlqr self-assigned this Jun 5, 2024

qkrorlqr mentioned this issue Jun 5, 2024

[Filestore] Optimization epic #1160

Open

qkrorlqr mentioned this issue Jun 23, 2024

merge to stable-23-3: filestore multitablet mode support (#1350) and a minor blockstore test fix #1483

Merged

qkrorlqr mentioned this issue Jul 1, 2024

issue-1350: we shouldn't fail the whole ListNodes request in case we somehow (due to bugs) lose some of the files in followers #1526

Merged

qkrorlqr mentioned this issue Jul 2, 2024

issue-1350: XAttr ut and some request validation + uts #1534

Merged

debnatkh mentioned this issue Jul 10, 2024

issue-1350: renamenode should work for files in the same shard #1579

Merged

debnatkh mentioned this issue Jul 13, 2024

issue-1350: hard links should work for external nodes #1595

Merged

This was referenced Jul 14, 2024

issue-1350: hard links for external modes should return correct node #1600

Merged

issue-1350: support multishard configuration for running it locally #1606

Merged

issue-1350: locks should work for external nodes #1610

Merged

qkrorlqr mentioned this issue Jul 15, 2024

issue-1350: fio_index nemesis test for multitablet filestores #1611

Merged

debnatkh mentioned this issue Jul 15, 2024

issue-1350: add fs posix compliance test for multishard mode #1613

Merged

qkrorlqr mentioned this issue Aug 1, 2024

issue-1350: GetNodeAttrBatchEnabled: true for multitablet filestore tests #1706

Merged

qkrorlqr mentioned this issue Aug 26, 2024

issue-1350: fixed ListNodes -> UnlinkNode -> GetNodeAttr race which lead to E_IO due to having empty node attrs in the listing results #1869

Merged

qkrorlqr mentioned this issue Sep 2, 2024

[Filestore] Automatic shard creation based on FS size #1932

Open

qkrorlqr closed this as completed Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Filestore] Multitablet filesystems #1350

[Filestore] Multitablet filesystems #1350

qkrorlqr commented Jun 5, 2024 •

edited

Loading

qkrorlqr commented Jul 2, 2024

debnatkh commented Jul 4, 2024

qkrorlqr commented Jul 10, 2024

qkrorlqr commented Jul 22, 2024 •

edited

Loading

qkrorlqr commented Aug 25, 2024 •

edited

Loading

qkrorlqr commented Aug 29, 2024

qkrorlqr commented Sep 2, 2024

[Filestore] Multitablet filesystems #1350

[Filestore] Multitablet filesystems #1350

Comments

qkrorlqr commented Jun 5, 2024 • edited Loading

qkrorlqr commented Jul 2, 2024

debnatkh commented Jul 4, 2024

qkrorlqr commented Jul 10, 2024

qkrorlqr commented Jul 22, 2024 • edited Loading

qkrorlqr commented Aug 25, 2024 • edited Loading

qkrorlqr commented Aug 29, 2024

qkrorlqr commented Sep 2, 2024

qkrorlqr commented Jun 5, 2024 •

edited

Loading

qkrorlqr commented Jul 22, 2024 •

edited

Loading

qkrorlqr commented Aug 25, 2024 •

edited

Loading