[Filestore] Multitablet filesystems

qkrorlqr commented 1 month ago

Right now one FS == one IndexTablet which is a bottleneck for:

fetching and updating block layer index info upon read and write requests - severely limits our read/write IOPS
fetching and updating byte layer data upon unaligned read and write requests
processing file metadata requests like open/close/stat/create/delete

It's also a limiting factor for max FS size because a single tablet needs to be able to store the whole block layer index - this index can't be too big. In fact handling a 100+TiB FS is a challenge already.

We need to be able to provide linear scalability for FS size and single file-level ops:

any kind of file IO
open/close/stat calls

The suggested solution is to make N + 1 tablets for a single FS where N will be determined based on the FS size upon FS creation. First versions may even require manual creation of the additional N tablets. 1 tablet would store the directory inodes and node refs. The refs which point to file inodes would point to other N tablets which would store all files directly under the root, the names of the node refs in their root directories may simply be guids.

Technically, file creation and deletion would cause a multi-tablet transaction, but it won't be hard to implement - we don't need a full 2PC here and we can also keep a cache of pre-created 0-size files to be able to serve creation requests without a multi-tablet transaction. Deletion can be served asynchronously - the client wouldn't be able to find a file which was deleted after we delete its last node ref (if there are no open file handles) - so there is no need for a real synchronous multi-tablet transaction here either. Again, the first version doesn't need to have those optimizations and can simply handle multi-tablet transactions in the following way:

in addition to modifying the node refs table we should also add an entry about the requested op (create/delete node) to a log table
as long as we have op entries in the log table we should repeatedly retry those ops - as soon as an op completes, it should be deleted from that table.
the client (TStorageServiceActor) should return proper error code if it sees a mismatch between a node ref and a node (e.g. a ref to a not-yet-created node should be treated as if the file doesn't exist - ENOENT)

What needs to be done in the first version:

support external node references - node refs may point to either node ids local to the tablet or pairs <tabletId, nodeName> for external node references
support external inodes in TStorageServiceActor - it should be ready to receive a node ref in the form of <tabletId, nodeName> in response to its requests which work with file inodes and perform a second request to tabletId - handleId -> <tabletId, nodeName (or nodeId)> cache should be implemented in the first version I suppose
support tablet relations configuration - there should be an ability to tell a tablet to create file inodes in a bunch of other tablets
support the ability to create/delete external inodes

We also need to properly track sessions in slave tablets. I think the easiest way do do that is creating the sessions in the slave tablets by the master tablet upon session creation in the master tablet. Master tablet is then responsible for pinging slave tablet sessions.

qkrorlqr commented 3 days ago

Screenshot from 2024-07-02 21-18-30 filestore containing 10+1 tablets fio results for 60 clients x 32 numjobs x 1 iodepth read: 128KiB x 560k IOPS = 70-75GB/s throughput write: 1MiB x 20-25k IOPS = 20-25GB/s throughput (unstable, sometimes high vdisk latency percentiles for PutTabletLog spike and performance decreases)

debnatkh commented 1 day ago

Same configuration:

filestore containing 10+1 tablets
60 clients x 32 numjobs

mpi run+ior results:

mpirun --mca routed direct -H $NODES -np 1920 ior -o /root/mnt/ior_file_4 -t 1m -b 1G -F -C --posix.odirect

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     33223      33226      0.043396    1048576    1024.00    0.671776   59.17      24.18      59.18      0   
read      68913      68937      0.026574    1048576    1024.00    0.136836   28.52      11.87      28.53      0

mpirun --mca routed direct -H $NODES -np 1920 ior -o /root/mnt/ior_file_5 -t 128k -b 1G -F -C --posix.odirect

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     5919       47351      0.038011    1048576    128.00     0.693598   332.17     137.91     332.18     0   
read      23958      191680     0.007246    1048576    128.00     0.097548   82.06      41.54      82.06      0

ydb-platform / nbs

[Filestore] Multitablet filesystems #1350