ydb-platform / nbs

Network Block Store
Apache License 2.0
50 stars 14 forks source link

[Filestore] Multitablet filesystems #1350

Open qkrorlqr opened 1 month ago

qkrorlqr commented 1 month ago

Right now one FS == one IndexTablet which is a bottleneck for:

It's also a limiting factor for max FS size because a single tablet needs to be able to store the whole block layer index - this index can't be too big. In fact handling a 100+TiB FS is a challenge already.

We need to be able to provide linear scalability for FS size and single file-level ops:

The suggested solution is to make N + 1 tablets for a single FS where N will be determined based on the FS size upon FS creation. First versions may even require manual creation of the additional N tablets. 1 tablet would store the directory inodes and node refs. The refs which point to file inodes would point to other N tablets which would store all files directly under the root, the names of the node refs in their root directories may simply be guids.

Technically, file creation and deletion would cause a multi-tablet transaction, but it won't be hard to implement - we don't need a full 2PC here and we can also keep a cache of pre-created 0-size files to be able to serve creation requests without a multi-tablet transaction. Deletion can be served asynchronously - the client wouldn't be able to find a file which was deleted after we delete its last node ref (if there are no open file handles) - so there is no need for a real synchronous multi-tablet transaction here either. Again, the first version doesn't need to have those optimizations and can simply handle multi-tablet transactions in the following way:

  1. in addition to modifying the node refs table we should also add an entry about the requested op (create/delete node) to a log table
  2. as long as we have op entries in the log table we should repeatedly retry those ops - as soon as an op completes, it should be deleted from that table.
  3. the client (TStorageServiceActor) should return proper error code if it sees a mismatch between a node ref and a node (e.g. a ref to a not-yet-created node should be treated as if the file doesn't exist - ENOENT)

What needs to be done in the first version:

  1. support external node references - node refs may point to either node ids local to the tablet or pairs <tabletId, nodeName> for external node references
  2. support external inodes in TStorageServiceActor - it should be ready to receive a node ref in the form of <tabletId, nodeName> in response to its requests which work with file inodes and perform a second request to tabletId - handleId -> <tabletId, nodeName (or nodeId)> cache should be implemented in the first version I suppose
  3. support tablet relations configuration - there should be an ability to tell a tablet to create file inodes in a bunch of other tablets
  4. support the ability to create/delete external inodes

We also need to properly track sessions in slave tablets. I think the easiest way do do that is creating the sessions in the slave tablets by the master tablet upon session creation in the master tablet. Master tablet is then responsible for pinging slave tablet sessions.

qkrorlqr commented 3 days ago

Screenshot from 2024-07-02 21-18-30 filestore containing 10+1 tablets fio results for 60 clients x 32 numjobs x 1 iodepth read: 128KiB x 560k IOPS = 70-75GB/s throughput write: 1MiB x 20-25k IOPS = 20-25GB/s throughput (unstable, sometimes high vdisk latency percentiles for PutTabletLog spike and performance decreases)

debnatkh commented 1 day ago

Same configuration:

mpi run+ior results: