zbjornson / node-io-uring

Prototype using io_uring for true async fs I/O in libuv/Node.js
41 stars 4 forks source link

Support "current position" (position=null) writes, reads #1

Open innerop opened 5 years ago

innerop commented 5 years ago

Hi

Awesome that you made this.

I’d like to do a benchmark based on your work for fileAppend and wondering if I can supply nulls for offset in position in writeBuffer so it just appends the data at end of file.

Is that supported?

If not, I can take a stab at it, although my knowledge of C++ is very rusty.

Thanks for sharing!

zbjornson commented 5 years ago

Hey, thanks for the interest.

null can't be passed yet (see TODO here: https://github.com/zbjornson/node-io-uring/blob/e28d3543e3d17f4e9f01b73b0b19bd42f85cb18d/iou.cc#L115). I keep meaning to ask @axboe if non-positional (current position) file read/writes are going to be possible in the future as io_uring doesn't support it right now. Until then, I think Node would have to use lseek(2) to get and update the file position, but that makes it a non-atomic operation.

axboe commented 5 years ago

That doesn't work so well for async interfaces, so no, there will not be an option to use 'current' position.

idibidiart commented 5 years ago

@zbjornson

For my test, there's no possibility of concurrent writes to the file, so not having it be atomic is fine. I'm thinking of getting current file size from fs.stat using that that as the position argument (current end of file) -- see any issues with that? (I'll isolate the time cost of fs.stat call)

Thanks once again, and excited to see how well it performs in append mode

zbjornson commented 5 years ago

@idibidiart yes, fs.stat would work if you're appending to the end of a file. Ultimately lseek(2) would probably be the correct way to do it (or just have Node/libuv maintain the current position internally).

Note that this module still requires a custom Linux kernel build. The required patch is now in the mainline kernel but hasn't been released yet. If you don't want to build your own kernel but can provide benchmarking code, I'm happy to run it for you.

idibidiart commented 5 years ago

@zbjornson having Node/libuv maintain the position in memory would imply statefulness between writes and if the connection is not persistent (e.g. http) the request (in a real life scenario) could go to any server (assuming a shared fs) so like you say lseek(2) should be used... Does that mean (for now and until node/libuv adopt uring) you could expose the lseek functionality in the bindings?

I've built Linux kernel for embedded systems in the past and it wasn't not bad at all. I'd like to benchmark this on both bare metal and an EC2 I3 instance. Having said that, please feel free to do an append mode benchmark if you're interested in that and don't wait on me since this is exclusively a curiosity driven activity on my end :) and not sure when I'll get to it, but I wanted to know that it is possible in the first place. Exposing lseek behavior in the bindings would be great (if it makes sense) but in the meantime I suppose fs.stat will do....

You may close this issue or keep it open, but I'll post here anyway if/when I have some results to share.

Thank you so much.

zbjornson commented 5 years ago

@idibidiart I'm happy to make the lseek(2) changes so current-position writes work. Can do tomorrow or Thursday probably.

If you're comfortable doing your own kernel build, by all means, go ahead :). No rush from my side, I wasn't planning to work much on this until another kernel release is coming up. (Before that, there's no point pulling it into libuv.)

idibidiart commented 5 years ago

@zbjornson

I've put together a demo (for work) that uses a chunk buffer (a chunk can be up to e.g. 32MB) where each chunk comes in 64 realtime-compressed slices over websocket, so I can send progress events back) then write it to fs via createWriteStream with default highWaterMark of 64KB. The realtime compression is via a C++ native module, using SoTA compression algorithm. Writing each chunk takes less time than getting all the compressed slices across the network, but to play it safe, I'm using createWriteStream to avoid invoking fs.write in none strictly serial manner. I should be able to use your writeBuffer once you have seek support or with fs.stat (although the latter will slow things down)

Here's a demo (sorry, the code belongs to employer but explanation above gives an idea of how it works, which is not novel by any chance, except maybe in the result:)

https://www.youtube.com/watch?v=dHOY1dnzViM

idibidiart commented 5 years ago

but I'd like to benchmark it first... :)

zbjornson commented 5 years ago

Node.js' fs.WriteStreams actually maintain their position internally and pass that to fs.write; they don't use current-position writes. https://github.com/nodejs/node/blob/03d43539f93be2c401f57d7eac87a5ae933ef905/lib/internal/fs/streams.js#L312 Using current-position writes wouldn't make fs.write safely append your buffers in sequence if you can't guarantee that you call fs.write serially.

You might be able to do something like require("fs").write = require("node-io-uring").writeBuffer to make WriteStreams use this module.

It's also safe to call writeBuffer multiple times without waiting for a prior one to complete. (I don't know why Node's fs.write claims to be unsafe for this purpose as long as you provide a file position and don't open the file with O_APPEND; pwrite(2) is thread-safe.)

So maybe something like this would work?

let position = 0;
client.onSlice = slice => {
  iou.writeBuffer(fd, slice, 0, slice.length, position, (err, bytesWritten) => {
    // Handle if err or bytesWritten< slice.length
    // Wait for all callbacks to be called --> done
  });
  position += slice.length;
};

You mentioned what sounds like multiple servers processing your data, but file descriptors and thus fd positions are local to each process, so that doesn't help your scenario, if I understand correctly; unless you're doing something fancy with sharing fds (and even then, I don't know if positions are shared).


(Side note: you might also benchmark using fs.writeSync instead of an fs.WriteStream. In the application I work on for work, that sped up overall performance nearly 2x. You still get some async work happening, as the kernel is able to read from the socket while the Node thread is blocked writing.)

zbjornson commented 5 years ago

I don't know why Node's fs.write claims to be unsafe for this purpose as long as you provide a file position and don't open the file with O_APPEND; pwrite(2) is thread-safe.

Found the answer: https://github.com/nodejs/node/issues/18634. It indeed sounds safe if you're always on POSIX.

idibidiart commented 5 years ago

Using current-position writes wouldn't make fs.write safely append your buffers in sequence if you can't guarantee that you call fs.write serially.

My mind has trouble understanding the rationale behind lseek. What use cases are there for moving the current read/write position statefully instead of just specifying the position when writing/reading?

I realize that createWriteStream keeps track of the position on JS side and even though I read the current file size from disk before I start calling WriteableStream.write on each connection it's possible that I end up with concurrent write from another server (e.g. after a disconnection/reconnection with no sticky session) However, when the socket disconnects the WriteableStream is ended immediately and it takes longer for the connection to reestablish (2 seconds delay) on a different server than for the WriteableStream to be closed, unless the event loop is overwhelmed.

But I am hoping the underlying fs.write is threadsafe (either writes the data or doesn't and won't write if another process is writing to the file.) Thanks for confirming that.

Does that mean your bindings.writeBuffer is thread safe, too? I assume it is, in which case I'll try what you suggested with the require swap. But first I gotta test your hint that writeSync is faster than WriteableStream.write. I'm buffering anyway.

Wow. Much to learn. Thank you.

idibidiart commented 5 years ago

I'll post the results of moving to writeSync and then move on to trying bindings.writeBuffer and keeping track of position in JS.

I really do not see benefit after this dialog of adding lseek for append (at current position) kind of behavior, but for compatibility with node's fs api I guess it'll have to be done.

idibidiart commented 5 years ago

Looks like the way I was using WriteableStream.write by having it write every 32MB chunk (aka buffer) instead of every 0.5MB slice-of-a-chunk (aka slice) was going at maximum speed (for one upload at a time -- I've yet to benchmark many concurrent uploads) Having it write every 0.5MB slice in rapid succession is super slow in comparison. I can look at the code for WriteableStream.write and tell why but that's a waste of time since it does what it does and I'm using it in optimal way.

So ... there was no difference in moving to writeSync in my case, nor was there any difference in moving to write (done safely, with position supplied explicitly in write mode)

But now that I'm down to fs.write() I can easily plugin your writeBuffer and see how much I gain for a single upload as well as for many concurrent uploads.

Update:

The link to the patch that you have in the Readme points to an invalid commit now. I did see that axboe is working on io_uring still, so it maybe wise to wait for the release of the new kernel, although like I said building the Linux kernel is always fun =)

innerop commented 5 years ago

~ 2GB in 2 minutes over 4G-speed upload bandwidth, uploading 10 files at once.

https://www.youtube.com/watch?v=EnSt4yf1K0c

This is using fs.write with a 32MB buffer

zbjornson commented 5 years ago

What use cases are there for moving the current read/write position statefully instead of just specifying the position when writing/reading?

I'm not sure of any concrete examples. Usually it's trivial to store a file's position along with the fd, so positional writes are the way to go. If you're doing unusual/fancy stuff with shared fds (dup(2)/dup2(2)/dup3(2) or forking) then maybe it's easier to let the kernel store the fd position than for you to sync the fd position across threads/processes.

Does that mean your bindings.writeBuffer is thread safe, too?

Yeah, but to be clear, you'll probably run into trouble if you try to write two different things to the same file position at the "same" time. For resumable uploads I assume you'd be writing the same data though.

using WriteableStream.write by having it write every 32MB chunk ... was going at maximum speed

If you're bottlenecked on actual I/O throughput (disk I/O -- or the rate at which the kernel can flush fs cache pages to disk) then io_uring isn't going to help (as far as I know -- there might be some performance differences at the actual I/O layer too that I'm not aware of). io_uring reduces the overhead of the write setup, which should be relatively small compared to the time it takes to write 32 MB.

writeSync instead of fs.WriteStream I think was primarily faster for us because Node.js' streams are pretty slow/high-overhead, but it also avoids the expense of waiting for an iteration of the event loop.

write(2) (and writeSync()) only blocks as long as it takes to copy your buffer into the kernel's fs page cache, which is usually very fast (~"not blocking"). Two exceptions to that are if you're doing direct I/O (unbuffered), or if the kernel has to free pages by flushing them to disk before it can copy your buffer. (Might be other exceptions but those are the common ones.)

The link to the patch that you have in the Readme points to an invalid commit now

It's in mainline now: https://github.com/torvalds/linux/commit/9b402849e80c85eee10bbd341aab3f1a0f942d4f#diff-a196e54ec8b5398427f9df3d2b074478. I'll update the readme with that.

idibidiart commented 5 years ago

@zbjornson

My analysis, which may be wrong, is that Node has only one event loop at a high level (not getting into macro/micro tasks, and aside from workers) and when that loop gets overfilled with events from the async stream write API things slow down considerably, so fewer async tasks means faster time to execution of each task. That's why calling stream write 64 times less has a great performance advantage (I hold a buffer of 32MB and I send buffer/64 sized slices to multiplex more efficiently over the underlying OS sockets.

I know the rabbit hole goes deep, but I'm working by benchmarking not so much by analysis =)

idibidiart commented 5 years ago

Btw, reducing the number of upload progress events by 3X led to speed up when many files are being uploaded/written by the same server by multiple worker threads on the client. It did not affect the case with a single upload and the effect is more noticeable the more files are uploaded/written concurrently. I would like to try io_uring and just see what the effect is. But I'll wait for support in libuv. Are the llbuv maintainers interested in adopting it?