[Filestore] non-blocking open/close

qkrorlqr commented 1 month ago

open (CreateHandle) and close (DestroyHandle) do the following expensive things (taking DestroyHandle as an example):

https://github.com/ydb-platform/nbs/blob/bc2f8e0966f19b88d9c3eb8fbc940d3a4d9b0871/cloud/filestore/libs/storage/tablet/tablet_actor_destroyhandle.cpp#L97 - deleting the handle from the session structure
https://github.com/ydb-platform/nbs/blob/bc2f8e0966f19b88d9c3eb8fbc940d3a4d9b0871/cloud/filestore/libs/storage/tablet/tablet_actor_destroyhandle.cpp#L107 - deleting the inode (if this handle was holding the last ref to this inode)

Both operations make modifications to the index stored persistently in blobstorage. There is a way to actually make these modifications in background and ack the corresponding syscall to the guest quickly. The open call is a bit more complex - I won't describe the solution for it right now (but I have some ideas), but the close call seems to be simple to implement: we can simply ack the call straight away and issue the cleanup (handle deletion and possibly node deletion) in the background. If this process fails, we should retry it from the client. If the client dies, client session dies together with the client after a timeout of inactivity and the cleanup happens automatically.

qkrorlqr commented 1 month ago

Currently we use the client's virtio queue as our redo log. If we ack the request before its actual completion, we need another redo log. We can create a log file on top of tmpfs local to filestore-vhost. open/close request processing may look like this:

Send CreateHandleStage1/DestroyHandleStage1 to the tablet which would perform basic checks over the tablet's inmemory state and return a preliminary result (including HandleId for CreateHandle)
If the preliminary result is OK, we will write this result to the redo log
We respond to the client
We send CreateHandleStage2/DestroyHandleStage2 requests which wait for the actual completion of the operation and retry those requests upon errors
We should delete the corresponding op from the redo log upon Stage2 success

If filestore-vhost restarts, it should reread its redo log to continue retrying the operations. The handles are local to the client so no other client can interfere with them and cause nonretriable errors between Stage1 and Stage2.

Notes:

Stage1 and Stage2 remotely resemble prepare and commit stages of 2PC.
if we receive an IO op for HandleId before the corresponding Stage2 completes, we should treat E_FS_BADHANDLE (or EBADF) as a retriable error
Stage1 can trigger Stage2 inside the tablet automatically - explicit Stage2 requests from filestore-vhost will hit tablet DupCache in this case

qkrorlqr commented 1 week ago

In case the client's opening a new file with O_CREAT flag, we can do it in an async way as well (after checking that it's actually possible to open the file) and queue the writes if the file is not yet created at the moment when the writes arrive. If there is a race which makes the initial check pass but prevents us from creating the file, we can respond to those writes with an io error. It's not a universally correct solution of course so we can keep this logic under a feature flag.

ydb-platform / nbs

[Filestore] non-blocking open/close #1541