Open qkrorlqr opened 1 month ago
filestore containing 10+1 tablets
fio results for 60 clients x 32 numjobs x 1 iodepth
read: 128KiB x 560k IOPS = 70-75GB/s throughput
write: 1MiB x 20-25k IOPS = 20-25GB/s throughput (unstable, sometimes high vdisk latency percentiles for PutTabletLog spike and performance decreases)
Same configuration:
mpi run+ior results:
mpirun --mca routed direct -H $NODES -np 1920 ior -o /root/mnt/ior_file_4 -t 1m -b 1G -F -C --posix.odirect
access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ----
write 33223 33226 0.043396 1048576 1024.00 0.671776 59.17 24.18 59.18 0
read 68913 68937 0.026574 1048576 1024.00 0.136836 28.52 11.87 28.53 0
mpirun --mca routed direct -H $NODES -np 1920 ior -o /root/mnt/ior_file_5 -t 128k -b 1G -F -C --posix.odirect
access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ----
write 5919 47351 0.038011 1048576 128.00 0.693598 332.17 137.91 332.18 0
read 23958 191680 0.007246 1048576 128.00 0.097548 82.06 41.54 82.06 0
Right now one FS == one IndexTablet which is a bottleneck for:
It's also a limiting factor for max FS size because a single tablet needs to be able to store the whole block layer index - this index can't be too big. In fact handling a 100+TiB FS is a challenge already.
We need to be able to provide linear scalability for FS size and single file-level ops:
The suggested solution is to make N + 1 tablets for a single FS where N will be determined based on the FS size upon FS creation. First versions may even require manual creation of the additional N tablets. 1 tablet would store the directory inodes and node refs. The refs which point to file inodes would point to other N tablets which would store all files directly under the root, the names of the node refs in their root directories may simply be guids.
Technically, file creation and deletion would cause a multi-tablet transaction, but it won't be hard to implement - we don't need a full 2PC here and we can also keep a cache of pre-created 0-size files to be able to serve creation requests without a multi-tablet transaction. Deletion can be served asynchronously - the client wouldn't be able to find a file which was deleted after we delete its last node ref (if there are no open file handles) - so there is no need for a real synchronous multi-tablet transaction here either. Again, the first version doesn't need to have those optimizations and can simply handle multi-tablet transactions in the following way:
What needs to be done in the first version:
We also need to properly track sessions in slave tablets. I think the easiest way do do that is creating the sessions in the slave tablets by the master tablet upon session creation in the master tablet. Master tablet is then responsible for pinging slave tablet sessions.