Slow performance writing compared to bazil.org/fuse

ncw commented 3 years ago

Hi!

I'm trying to track down a performance issue with cgofuse vs bazil.org/fuse in rclone.

Here is cgofuse mounting a local disk on Linux.

$ dd if=/tmp/1G of=/mnt/tmp/1G bs=128k
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 18.2636 s, 58.8 MB/s

And here is the mount command. Note the 4k writes

rclone cmount -vv /tmp/data /mnt/tmp/ 2>&1 | grep "Write: "
...
2021/03/24 16:13:04 DEBUG : /1G: >Write: n=4096

And here is bazil.org/fuse

$ time dd if=/tmp/1G of=/mnt/tmp/1G bs=128k
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.67968 s, 229 MB/s

And its mount command - note the 128k blocks

rclone mount -vv /tmp/data /mnt/tmp/ 2>&1 | grep "Write: "
...
2021/03/24 16:14:42 DEBUG : &{1G (w)}: >Write: written=131072, err=<nil>

I'd say this difference is entirely down to the different sized write blocks, but I haven't been able to change it. I've tried loads of fuse parameters (max_write,auto_cache) and I haven't been able to budge cgofuse from its 4k blocks for write.

Fuse mysteriously says this in the docs for max_write

max_write=N

Set the maximum number of bytes in a single write operation. The default is 128kbytes. Note, that due to various limitations, the size of write requests can be much smaller (4kbytes). This limitation will be removed in the future.

This was originally noted with macOS on the rclone forum but it replicated with linux for me very easily.

Any help much appreciated - running short of hair to pull out ;-)

(Testing done with rclone master (built with go build -tags cmount to include cgofuse support on Linux, and use --debug-fuse to see the fuse debug).)

billziss-gh commented 3 years ago

Interesting. As cgofuse is really a thin shim over libfuse, this must be a misconfiguration of libfuse.

Configuration of the FUSE "connection" for the cgo version of cgofuse happens in hostAsgnCconninfo.

Since you are trying this on Linux my recommendation would be to test different values for max_write (and other fields) in the struct fuse_conn_info that you get in hostAsgnCconninfo. For example, you could try setting max_write as below and retry your benchmarks (untested):

conn->max_write = 128 * 1024;

There is also the possibility that libfuse2 (that cgofuse uses) does not have optimizations in libfuse3, in which case we may have to finally move cgofuse to support libfuse3. This is a much larger undertaking, because we would not want to break compatibility with systems that only have the FUSE2 API.

darthShadow commented 3 years ago

@ncw Adding -o direct_io to the rclone mount solves the issue for me on Linux.

ncw commented 3 years ago

direct_io sounds like the wrong thing to add with my reading of the docs.

https://libfuse.github.io/doxygen/structfuse__config.html#ae335bab50dfddef49b0ed81671066fa8

If you set the block size of the dd to 4k then it will change the write block size I think. I'll try this tomorrow.

However it shows that the block size is the problem.

darthShadow commented 3 years ago

It actually seems correct based on my usage of it in mergerfs too across multiple servers to get the best performance. The moment we introduce kernel caching (by not adding direct_io), the performance drops down as it essentially double-caches the file with no control over how fast the kernel sends it to the application. Admittedly, I don't know much about the internals of fuse, other than what I have learnt from my testing, so there may be some other explanation too.

Without direct_io:

darthshadow@server:~/test$ rclone cmount temp-1 temp-2 -vv --debug-fuse 2>&1 | grep Write: > cmount_no_direct_io_writes.txt
^C
darthshadow@server:~/test$ head -10 cmount_no_direct_io_writes.txt
2021/03/25 04:47:30 DEBUG : /1G.img: Write: ofst=0, fh=0x0
2021/03/25 04:47:30 DEBUG : /1G.img: >Write: n=4096
2021/03/25 04:47:30 DEBUG : /1G.img: Write: ofst=4096, fh=0x0
2021/03/25 04:47:30 DEBUG : /1G.img: >Write: n=4096
2021/03/25 04:47:30 DEBUG : /1G.img: Write: ofst=8192, fh=0x0
2021/03/25 04:47:30 DEBUG : /1G.img: >Write: n=4096
2021/03/25 04:47:30 DEBUG : /1G.img: Write: ofst=12288, fh=0x0
2021/03/25 04:47:30 DEBUG : /1G.img: >Write: n=4096
2021/03/25 04:47:30 DEBUG : /1G.img: Write: ofst=16384, fh=0x0
2021/03/25 04:47:30 DEBUG : /1G.img: >Write: n=4096

darthshadow@server:~/test$ (rm temp-2/1G.img || true) && dd if=1G.img of=temp-2/1G.img count=1024 bs=1048576
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 32.9387 s, 32.6 MB/s

With direct_io:

darthshadow@server:~/test$ rclone cmount -o direct_io temp-1 temp-2 -vv --debug-fuse 2>&1 | grep Write: > cmount_direct_io_writes.txt
^C
darthshadow@server:~/test$ head -10 cmount_direct_io_writes.txt
2021/03/25 04:46:59 DEBUG : /1G.img: Write: ofst=0, fh=0x0
2021/03/25 04:46:59 DEBUG : /1G.img: >Write: n=131072
2021/03/25 04:46:59 DEBUG : /1G.img: Write: ofst=131072, fh=0x0
2021/03/25 04:46:59 DEBUG : /1G.img: >Write: n=131072
2021/03/25 04:46:59 DEBUG : /1G.img: Write: ofst=262144, fh=0x0
2021/03/25 04:46:59 DEBUG : /1G.img: >Write: n=131072
2021/03/25 04:46:59 DEBUG : /1G.img: Write: ofst=393216, fh=0x0
2021/03/25 04:46:59 DEBUG : /1G.img: >Write: n=131072
2021/03/25 04:46:59 DEBUG : /1G.img: Write: ofst=524288, fh=0x0
2021/03/25 04:46:59 DEBUG : /1G.img: >Write: n=131072

darthshadow@server:~/test$ (rm temp-2/1G.img || true) && dd if=1G.img of=temp-2/1G.img count=1024 bs=1048576
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.29815 s, 147 MB/s

ncw commented 3 years ago

I did a bit of experimenting with rclone mount vs rclone cmount -o direct_io

They appear to behave the same so if you dd with a block size of lower than 128k then that block size gets passed direct to fuse which is what you'd expect from direct_io.

which is odd because I don't think I'm setting direct_io in rclone mount except when we have an unknown size file.

https://github.com/rclone/rclone/blob/f6dbb98a1dbf818caa0b8d427db33b529d88b932/cmd/mount/file.go#L76-L79

Anyway I've always been lead to believe that direct_io is slow not fast - the kernel page cache is there for a reason so I'm still puzzled.

darthShadow commented 3 years ago

There may be some changes because of the used FUSE protocol version too. cgofuse, from my testing, uses version 7.19 of the fuse protocol (7.19) whereas bazil/fuse uses 7.12 with mergerfs using 7.31 (which is the latest).

Edit: Corrected the versions.

darthShadow commented 3 years ago

Adding -o big_writes gives the same improvement as direct_io too. And that is enabled by default in bazil/fuse https://github.com/bazil/fuse/blob/master/fuse.go#L229

ncw commented 3 years ago

I think that is it @darthShadow -o big_writes.

Interestingly big_writes is not in any of the fuse man pages I've found.

I found this in the fuse changelog for v3.0.0 (and that was the only reference in the entire git repo!)

The -o big_writes mount option has been removed. It is now always active. File systems that want to limit the size of write requests should use the -o max_write=<N> option instead.

So that probably explains why I can't find big_writes in any documentation!

Setting -o max_write=1048576 does absolutely nothing without -o big_writes. cgofuse/fuse obeys max_write up to a maximum of 128k per write.

So I guess the solution for this is for rclone to set -o big_writes normally - what do you think @darthShadow ?

Longer term migrating cgofuse to libfuse v3 will also fix the problem as it enables big writes by default apparently.

darthShadow commented 3 years ago

Ideally, cgofuse should be doing it instead since I can't see any downside to it and bazil/fuse also has it enabled by default.

There is one other large improvement that came recently in fuse with kernels > 4.20 that should probably also be enabled by default which is the max_pages. It defaults to 32 by default and after 4.20, allows for an increase up to 256, making the max_write change from 128k to 1M.

However, I am not sure how to go about implementing it in cgofuse due to not knowing C and not having the time to learn it right now.

If someone wants to take it up, these are the reference PR(s) in bazil/fuse (for Go) & mergerfs (for C):

https://github.com/bazil/fuse/pull/239 https://github.com/trapexit/mergerfs/pull/636

billziss-gh commented 3 years ago

@ncw

So I guess the solution for this is for rclone to set -o big_writes normally

@darthShadow

Ideally, cgofuse should be doing it instead since I can't see any downside to it and bazil/fuse also has it enabled by default.

Arguments for adding -o big_writes:

Better performance out of the box.
Some FUSE libraries/platforms do not limit writes. (I know for a fact that WinFsp does not.) Therefore a true cross-platform file system would have to handle "big" writes anyway.
Eliminate need for arcane system specific information on how to extract best write performance from FUSE.

Arguments against adding -o big_writes:

Potentially breaking backwards compatibility (e.g. for a Linux-only file system that relies on not having "big" writes).
It is not just -o big_writes. Determining optimal settings for every FUSE platform and baking them into cgofuse is a non-trivial problem.

In any case if we wanted to do this, we could scope it to those platforms that use libfuse:

Linux
macOS uses a long-ago fork of libfuse. I cannot remember if it supports -o big_writes.
FreeBSD also uses libfuse. I am unclear if it supports -o big_writes.

darthShadow commented 3 years ago

Based on the testing from here, looks like osxfuse doesn't respect big_writes and doesn't provide any tangible benefit. It instead has blocksize (which is missing on libfuse in Linux) to get the same benefits.

billziss-gh commented 3 years ago

@darthShadow thanks for doing all this research.

We can use your information to enable "big" writes across all supported systems (either implicitly or perhaps via a new FileSystemHost.SetCapBigWrites method (link)).

darthShadow commented 3 years ago

Do we want to set blocksize for macOS too? I am not sure if there are any downsides to it since there is almost 0 documentation about it apart from a mention in https://github.com/osxfuse/osxfuse/issues/507#issuecomment-404536504

Any thoughts on adding max-pages support too?

darthShadow commented 3 years ago

Got this from @trapexit for when we want to bump up the fuse version and need the fuse API changes: https://github.com/trapexit/mergerfs/blob/master/libfuse/include/fuse_kernel.h#L123

billziss-gh commented 3 years ago

Do we want to set blocksize for macOS too? I am not sure if there are any downsides to it since there is almost 0 documentation about it apart from a mention in osxfuse/osxfuse#507 (comment)

The OSXFUSE mount options document mentions iosize, but not blocksize. We may have to consult the osxfuse/macfuse source on this.

Any thoughts on adding max-pages support too?

I am not opposed to add more options, but we must do so with a risk vs reward consideration. Every time we change something there is a risk that we will break an existing user. Our reward is of course that we fix a functionality or performance problem.

Unfortunately a lot of these options are not well described. Reading the source sometimes answers questions, but often real-world effects are missed. For example, does max_pages only affect size of writes as discussed here? Or can it have other consequences that may be deleterious in some scenarios?

Having said this I think there is a lot of value in what you are doing @darthShadow. If you or someone else compiles a comprehensive list of options/configurations/etc. that positively affect performance, I would certainly be interested to incorporate them into cgofuse in some form.

trapexit commented 3 years ago

max_pages impacts the max size of a fuse message. Practically speaking it impacts reads and writes the most (as getdent/readdir is currently limited to 4k messages by the kernel unfortunately.) Besides needing larger buffers to handle the size change there is no other effect.

ncw commented 3 years ago

@billziss-gh wrote:

Do we want to set blocksize for macOS too? I am not sure if there are any downsides to it since there is almost 0 documentation about it apart from a mention in osxfuse/osxfuse#507 (comment)

The OSXFUSE mount options document mentions iosize, but not blocksize. We may have to consult the osxfuse/macfuse source on this.

I don't think the sources for v4 are available -( The sources for v3 are though.

Any thoughts on adding max-pages support too?

I am not opposed to add more options, but we must do so with a risk vs reward consideration. Every time we change something there is a risk that we will break an existing user. Our reward is of course that we fix a functionality or performance problem.

It is easy enough for rclone to set -o big_writes and -o blocksize as options which are parsed by libfuse so we don't need to patch cgofuse to enable this support - I could do this very easily - what do you think @darthShadow ?. This fixes the backwards compatibility story for cgofuse. Maybe a note in the cgofuse docs saying these are recommended for performance.

However with max_pages I note from @trapexit excellent doc that we need fuse version 7.28 for max_pages. I don't know whether we need to upgrade cgofuse somehow to enable that of whether -o max_pages=XXX will just work without changing -DFUSE_USE_VERSION=28 in the cgofuse sources.

I don't think max_pages is supported on my Linux machine Ubuntu 20.04.1 LTS this returns nothing strings /usr/lib/x86_64-linux-gnu/libfuse*.so | grep max_pages so I haven't been able to try it!

trapexit commented 3 years ago

@ncw max_pages is available in libfuse3 v3.6+ I believe. The default kernel in 20.04.1 should be fine. The feature was added to 4.20.

darthShadow commented 3 years ago

Both big_writes & max_pages seem like low-effort high-impact changes which should probably be enabled by default in the library.

If we do decide not to enable either of the above flags, then we should at least consider bumping up the supported fuse version to 7.31 so we can at least make use of those on the client (rclone).

blocksize on the other hand makes a good candidate for enabling it on the rclone side only (at least for now) in case there are any non-obvious downsides to it.

Maybe a note in the cgofuse docs saying these are recommended for performance.

A performance section of some obvious flags to try sounds like a good idea.

I don't think max_pages is supported on my Linux machine Ubuntu 20.04.1 LTS

It should be supported as long as you have a kernel greater than 4.20, which is the case by default in 20.04.

this returns nothing strings /usr/lib/x86_64-linux-gnu/libfuse*.so | grep max_pages

I am not sure if that is the correct test for it. I don't even have any libfuse*.so files (probably because I don't have the libfuse-dev package installed), and I was still able to try it out with mergerfs and see it reflected correctly in the response returned by the kernel and dumped by mergerfs to the stdout like this:

For the default 32 pages (pre kernel 4.20 & fuse 7.28):

mergerfs -d -o xattr=nosys -o cache.attr=0 -o minfreespace=0 -o fuse_msg_size=32 -o async_read=true -o statfs_ignore=ro temp-1 temp-2

FUSE library version: 2.9.7-mergerfs_2.30.0
unique: 2, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0
INIT: 7.31
flags=0x03fffffb
max_readahead=0x00020000
   INIT: 7.31
   flags=0x00448079
   max_readahead=0x00020000
   max_write=0x00100000
   max_background=0
   congestion_threshold=0
   max_pages=32
   unique: 2, success, outsize: 80

For the increased value of 256 pages (post kernel 4.20 & fuse 7.28):

mergerfs -d -o xattr=nosys -o cache.attr=0 -o minfreespace=0 -o fuse_msg_size=256 -o async_read=true -o statfs_ignore=ro temp-1 temp-2

FUSE library version: 2.9.7-mergerfs_2.30.0
unique: 2, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0
INIT: 7.31
flags=0x03fffffb
max_readahead=0x00020000
   INIT: 7.31
   flags=0x00448079
   max_readahead=0x00020000
   max_write=0x00100000
   max_background=0
   congestion_threshold=0
   max_pages=256
   unique: 2, success, outsize: 80

trapexit commented 3 years ago

It's not useful to use mergerfs as an example here wrt anything libfuse as it uses a custom fork of libfuse2 which I maintain for the project. That said it would show the kernel supports the feature.

ncw commented 3 years ago

Using --debug-fuse with rclone should show the same info.

However with kernel 5.4 on ubuntu 20.04 fuse complains about both the flags

$ rclone cmount /tmp/swiftsource/ /mnt/tmp/ -vv --debug-fuse -o fuse_msg_size=32
fuse: unknown option `fuse_msg_size=32'

$ rclone cmount /tmp/swiftsource/ /mnt/tmp/ -vv --debug-fuse -o max_pages=64
fuse: unknown option `max_pages=64'

This is with libfuse 2.9.9-3

trapexit commented 3 years ago

libfuse2 is deprecated and does not support the feature. It's only in libfuse3. And "fuse_msg_size" is a mergerfs thing. Not a libfuse thing.

https://github.com/libfuse/libfuse/commit/027d0d17c8a4605109f09d9c988e255b64a2c19a

darthShadow commented 1 year ago

Different fuse library but still relevant: https://github.com/distr1/distri/issues/59

ncw commented 1 year ago

I just retested rclone...

rclone cmount -vv /tmp/data /mnt/tmp/ 2>&1 | grep "Write: "
...
2023/02/04 15:47:34 DEBUG : /1G: >Write: n=4096

$ dd if=/tmp/1G of=/mnt/tmp/1G bs=128k
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.62388 s, 162 MB/s

But

rclone cmount -vv /tmp/data /mnt/tmp/  -o big_writes 2>&1 | grep "Write: "
...
2023/02/04 15:49:04 DEBUG : /1G: >Write: n=131072

$ dd if=/tmp/1G of=/mnt/tmp/1G bs=128k
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.37313 s, 318 MB/s

So this issue is still relevant.

That's for Linux - how does this perform on macOS @darthShadow with the correct -o flag?

We should put these workarounds in somewhere either in cgofuse or in rclone!

billziss-gh commented 1 year ago

We should put these workarounds in somewhere either in cgofuse or in rclone!

I just returned from a trip and catching up. What are the workarounds that need to be added into cgofuse and/or rclone. Was it just -o big_writes which could be added from the command line or were there other ones?

ncw commented 1 year ago

I just returned from a trip and catching up.

I hope it was a good one :-)

What are the workarounds that need to be added into cgofuse and/or rclone. Was it just -o big_writes which could be added from the command line or were there other ones?

I think -o big_writes for linux seems like a no-brainer. It's the default for libfuse3 too.

I can put it in rclone easily enough if you don't want to add it to cgofuse.

billziss-gh commented 1 year ago

I just returned from a trip and catching up.

I hope it was a good one :-)

Thank you. It was quite nice actually!

I think -o big_writes for linux seems like a no-brainer. It's the default for libfuse3 too.

I can put it in rclone easily enough if you don't want to add it to cgofuse.

I think it may make more sense to add such an option to rclone as it might break compatibility for other file systems.

ncw commented 1 year ago

Ok sounds good to me :-)

Are you thinking about supporting fuse3 on Linux? That seems to be standard everywhere now.

billziss-gh commented 1 year ago

It looks like it is something that has to be done sooner or later, although I am rather busy at the moment.

darthShadow commented 1 year ago

how does this perform on macOS

It's the same 16k with or without the flag, so no difference. Setting blocksize to 1048576 does increase it to 128k-sized writes but causes file corruption.

winfsp / cgofuse

Slow performance writing compared to bazil.org/fuse #55