Open ncw opened 3 years ago
Interesting. As cgofuse is really a thin shim over libfuse, this must be a misconfiguration of libfuse.
Configuration of the FUSE "connection" for the cgo
version of cgofuse happens in hostAsgnCconninfo
.
Since you are trying this on Linux my recommendation would be to test different values for max_write
(and other fields) in the struct fuse_conn_info
that you get in hostAsgnCconninfo
. For example, you could try setting max_write
as below and retry your benchmarks (untested):
conn->max_write = 128 * 1024;
There is also the possibility that libfuse2 (that cgofuse uses) does not have optimizations in libfuse3, in which case we may have to finally move cgofuse to support libfuse3. This is a much larger undertaking, because we would not want to break compatibility with systems that only have the FUSE2 API.
@ncw Adding -o direct_io
to the rclone mount solves the issue for me on Linux.
direct_io sounds like the wrong thing to add with my reading of the docs.
https://libfuse.github.io/doxygen/structfuse__config.html#ae335bab50dfddef49b0ed81671066fa8
If you set the block size of the dd to 4k then it will change the write block size I think. I'll try this tomorrow.
However it shows that the block size is the problem.
It actually seems correct based on my usage of it in mergerfs too across multiple servers to get the best performance. The moment we introduce kernel caching (by not adding direct_io
), the performance drops down as it essentially double-caches the file with no control over how fast the kernel sends it to the application. Admittedly, I don't know much about the internals of fuse, other than what I have learnt from my testing, so there may be some other explanation too.
Without direct_io
:
darthshadow@server:~/test$ rclone cmount temp-1 temp-2 -vv --debug-fuse 2>&1 | grep Write: > cmount_no_direct_io_writes.txt
^C
darthshadow@server:~/test$ head -10 cmount_no_direct_io_writes.txt
2021/03/25 04:47:30 DEBUG : /1G.img: Write: ofst=0, fh=0x0
2021/03/25 04:47:30 DEBUG : /1G.img: >Write: n=4096
2021/03/25 04:47:30 DEBUG : /1G.img: Write: ofst=4096, fh=0x0
2021/03/25 04:47:30 DEBUG : /1G.img: >Write: n=4096
2021/03/25 04:47:30 DEBUG : /1G.img: Write: ofst=8192, fh=0x0
2021/03/25 04:47:30 DEBUG : /1G.img: >Write: n=4096
2021/03/25 04:47:30 DEBUG : /1G.img: Write: ofst=12288, fh=0x0
2021/03/25 04:47:30 DEBUG : /1G.img: >Write: n=4096
2021/03/25 04:47:30 DEBUG : /1G.img: Write: ofst=16384, fh=0x0
2021/03/25 04:47:30 DEBUG : /1G.img: >Write: n=4096
darthshadow@server:~/test$ (rm temp-2/1G.img || true) && dd if=1G.img of=temp-2/1G.img count=1024 bs=1048576
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 32.9387 s, 32.6 MB/s
With direct_io
:
darthshadow@server:~/test$ rclone cmount -o direct_io temp-1 temp-2 -vv --debug-fuse 2>&1 | grep Write: > cmount_direct_io_writes.txt
^C
darthshadow@server:~/test$ head -10 cmount_direct_io_writes.txt
2021/03/25 04:46:59 DEBUG : /1G.img: Write: ofst=0, fh=0x0
2021/03/25 04:46:59 DEBUG : /1G.img: >Write: n=131072
2021/03/25 04:46:59 DEBUG : /1G.img: Write: ofst=131072, fh=0x0
2021/03/25 04:46:59 DEBUG : /1G.img: >Write: n=131072
2021/03/25 04:46:59 DEBUG : /1G.img: Write: ofst=262144, fh=0x0
2021/03/25 04:46:59 DEBUG : /1G.img: >Write: n=131072
2021/03/25 04:46:59 DEBUG : /1G.img: Write: ofst=393216, fh=0x0
2021/03/25 04:46:59 DEBUG : /1G.img: >Write: n=131072
2021/03/25 04:46:59 DEBUG : /1G.img: Write: ofst=524288, fh=0x0
2021/03/25 04:46:59 DEBUG : /1G.img: >Write: n=131072
darthshadow@server:~/test$ (rm temp-2/1G.img || true) && dd if=1G.img of=temp-2/1G.img count=1024 bs=1048576
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.29815 s, 147 MB/s
I did a bit of experimenting with rclone mount
vs rclone cmount -o direct_io
They appear to behave the same so if you dd
with a block size of lower than 128k then that block size gets passed direct to fuse which is what you'd expect from direct_io
.
which is odd because I don't think I'm setting direct_io in rclone mount
except when we have an unknown size file.
Anyway I've always been lead to believe that direct_io is slow not fast - the kernel page cache is there for a reason so I'm still puzzled.
There may be some changes because of the used FUSE protocol version too. cgofuse, from my testing, uses version 7.19 of the fuse protocol (7.19) whereas bazil/fuse uses 7.12 with mergerfs using 7.31 (which is the latest).
Edit: Corrected the versions.
Adding -o big_writes
gives the same improvement as direct_io too. And that is enabled by default in bazil/fuse https://github.com/bazil/fuse/blob/master/fuse.go#L229
I think that is it @darthShadow -o big_writes
.
Interestingly big_writes
is not in any of the fuse man pages I've found.
I found this in the fuse changelog for v3.0.0 (and that was the only reference in the entire git repo!)
-o big_writes
mount option has been removed. It is now
always active. File systems that want to limit the size of write
requests should use the -o max_write=<N>
option instead.So that probably explains why I can't find big_writes
in any documentation!
Setting -o max_write=1048576
does absolutely nothing without -o big_writes
. cgofuse/fuse obeys max_write
up to a maximum of 128k per write.
So I guess the solution for this is for rclone to set -o big_writes
normally - what do you think @darthShadow ?
Longer term migrating cgofuse to libfuse v3 will also fix the problem as it enables big writes by default apparently.
Ideally, cgofuse should be doing it instead since I can't see any downside to it and bazil/fuse also has it enabled by default.
There is one other large improvement that came recently in fuse with kernels > 4.20 that should probably also be enabled by default which is the max_pages. It defaults to 32 by default and after 4.20, allows for an increase up to 256, making the max_write change from 128k to 1M.
However, I am not sure how to go about implementing it in cgofuse due to not knowing C and not having the time to learn it right now.
If someone wants to take it up, these are the reference PR(s) in bazil/fuse (for Go) & mergerfs (for C):
https://github.com/bazil/fuse/pull/239 https://github.com/trapexit/mergerfs/pull/636
@ncw
So I guess the solution for this is for rclone to set
-o big_writes
normally
@darthShadow
Ideally, cgofuse should be doing it instead since I can't see any downside to it and bazil/fuse also has it enabled by default.
Arguments for adding -o big_writes
:
Arguments against adding -o big_writes
:
-o big_writes
. Determining optimal settings for every FUSE platform and baking them into cgofuse is a non-trivial problem.In any case if we wanted to do this, we could scope it to those platforms that use libfuse:
-o big_writes
.-o big_writes
.Based on the testing from here, looks like osxfuse doesn't respect big_writes
and doesn't provide any tangible benefit. It instead has blocksize
(which is missing on libfuse in Linux) to get the same benefits.
@darthShadow thanks for doing all this research.
We can use your information to enable "big" writes across all supported systems (either implicitly or perhaps via a new FileSystemHost.SetCapBigWrites
method (link)).
Do we want to set blocksize for macOS too? I am not sure if there are any downsides to it since there is almost 0 documentation about it apart from a mention in https://github.com/osxfuse/osxfuse/issues/507#issuecomment-404536504
Any thoughts on adding max-pages support too?
Got this from @trapexit for when we want to bump up the fuse version and need the fuse API changes: https://github.com/trapexit/mergerfs/blob/master/libfuse/include/fuse_kernel.h#L123
Do we want to set blocksize for macOS too? I am not sure if there are any downsides to it since there is almost 0 documentation about it apart from a mention in osxfuse/osxfuse#507 (comment)
The OSXFUSE mount options document mentions iosize
, but not blocksize
. We may have to consult the osxfuse/macfuse source on this.
Any thoughts on adding max-pages support too?
I am not opposed to add more options, but we must do so with a risk vs reward consideration. Every time we change something there is a risk that we will break an existing user. Our reward is of course that we fix a functionality or performance problem.
Unfortunately a lot of these options are not well described. Reading the source sometimes answers questions, but often real-world effects are missed. For example, does max_pages
only affect size of writes as discussed here? Or can it have other consequences that may be deleterious in some scenarios?
Having said this I think there is a lot of value in what you are doing @darthShadow. If you or someone else compiles a comprehensive list of options/configurations/etc. that positively affect performance, I would certainly be interested to incorporate them into cgofuse in some form.
max_pages impacts the max size of a fuse message. Practically speaking it impacts reads and writes the most (as getdent/readdir is currently limited to 4k messages by the kernel unfortunately.) Besides needing larger buffers to handle the size change there is no other effect.
@billziss-gh wrote:
Do we want to set blocksize for macOS too? I am not sure if there are any downsides to it since there is almost 0 documentation about it apart from a mention in osxfuse/osxfuse#507 (comment)
The OSXFUSE mount options document mentions
iosize
, but notblocksize
. We may have to consult the osxfuse/macfuse source on this.
I don't think the sources for v4 are available -( The sources for v3 are though.
Any thoughts on adding max-pages support too?
I am not opposed to add more options, but we must do so with a risk vs reward consideration. Every time we change something there is a risk that we will break an existing user. Our reward is of course that we fix a functionality or performance problem.
It is easy enough for rclone to set -o big_writes
and -o blocksize
as options which are parsed by libfuse
so we don't need to patch cgofuse to enable this support - I could do this very easily - what do you think @darthShadow ?. This fixes the backwards compatibility story for cgofuse. Maybe a note in the cgofuse docs saying these are recommended for performance.
However with max_pages
I note from @trapexit excellent doc that we need fuse version 7.28 for max_pages
. I don't know whether we need to upgrade cgofuse somehow to enable that of whether -o max_pages=XXX
will just work without changing -DFUSE_USE_VERSION=28
in the cgofuse sources.
I don't think max_pages
is supported on my Linux machine Ubuntu 20.04.1 LTS
this returns nothing strings /usr/lib/x86_64-linux-gnu/libfuse*.so | grep max_pages
so I haven't been able to try it!
@ncw max_pages is available in libfuse3 v3.6+ I believe. The default kernel in 20.04.1 should be fine. The feature was added to 4.20.
Both big_writes
& max_pages
seem like low-effort high-impact changes which should probably be enabled by default in the library.
If we do decide not to enable either of the above flags, then we should at least consider bumping up the supported fuse version to 7.31 so we can at least make use of those on the client (rclone).
blocksize
on the other hand makes a good candidate for enabling it on the rclone side only (at least for now) in case there are any non-obvious downsides to it.
Maybe a note in the cgofuse docs saying these are recommended for performance.
A performance section of some obvious flags to try sounds like a good idea.
I don't think max_pages is supported on my Linux machine Ubuntu 20.04.1 LTS
It should be supported as long as you have a kernel greater than 4.20, which is the case by default in 20.04.
this returns nothing strings /usr/lib/x86_64-linux-gnu/libfuse*.so | grep max_pages
I am not sure if that is the correct test for it. I don't even have any libfuse*.so files (probably because I don't have the libfuse-dev
package installed), and I was still able to try it out with mergerfs and see it reflected correctly in the response returned by the kernel and dumped by mergerfs to the stdout like this:
For the default 32 pages (pre kernel 4.20 & fuse 7.28):
mergerfs -d -o xattr=nosys -o cache.attr=0 -o minfreespace=0 -o fuse_msg_size=32 -o async_read=true -o statfs_ignore=ro temp-1 temp-2
FUSE library version: 2.9.7-mergerfs_2.30.0
unique: 2, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0
INIT: 7.31
flags=0x03fffffb
max_readahead=0x00020000
INIT: 7.31
flags=0x00448079
max_readahead=0x00020000
max_write=0x00100000
max_background=0
congestion_threshold=0
max_pages=32
unique: 2, success, outsize: 80
For the increased value of 256 pages (post kernel 4.20 & fuse 7.28):
mergerfs -d -o xattr=nosys -o cache.attr=0 -o minfreespace=0 -o fuse_msg_size=256 -o async_read=true -o statfs_ignore=ro temp-1 temp-2
FUSE library version: 2.9.7-mergerfs_2.30.0
unique: 2, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0
INIT: 7.31
flags=0x03fffffb
max_readahead=0x00020000
INIT: 7.31
flags=0x00448079
max_readahead=0x00020000
max_write=0x00100000
max_background=0
congestion_threshold=0
max_pages=256
unique: 2, success, outsize: 80
It's not useful to use mergerfs as an example here wrt anything libfuse as it uses a custom fork of libfuse2 which I maintain for the project. That said it would show the kernel supports the feature.
Using --debug-fuse
with rclone should show the same info.
However with kernel 5.4 on ubuntu 20.04 fuse complains about both the flags
$ rclone cmount /tmp/swiftsource/ /mnt/tmp/ -vv --debug-fuse -o fuse_msg_size=32
fuse: unknown option `fuse_msg_size=32'
$ rclone cmount /tmp/swiftsource/ /mnt/tmp/ -vv --debug-fuse -o max_pages=64
fuse: unknown option `max_pages=64'
This is with libfuse 2.9.9-3
libfuse2 is deprecated and does not support the feature. It's only in libfuse3. And "fuse_msg_size" is a mergerfs thing. Not a libfuse thing.
https://github.com/libfuse/libfuse/commit/027d0d17c8a4605109f09d9c988e255b64a2c19a
Different fuse library but still relevant: https://github.com/distr1/distri/issues/59
I just retested rclone...
rclone cmount -vv /tmp/data /mnt/tmp/ 2>&1 | grep "Write: "
...
2023/02/04 15:47:34 DEBUG : /1G: >Write: n=4096
$ dd if=/tmp/1G of=/mnt/tmp/1G bs=128k
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.62388 s, 162 MB/s
But
rclone cmount -vv /tmp/data /mnt/tmp/ -o big_writes 2>&1 | grep "Write: "
...
2023/02/04 15:49:04 DEBUG : /1G: >Write: n=131072
$ dd if=/tmp/1G of=/mnt/tmp/1G bs=128k
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.37313 s, 318 MB/s
So this issue is still relevant.
That's for Linux - how does this perform on macOS @darthShadow with the correct -o
flag?
We should put these workarounds in somewhere either in cgofuse or in rclone!
We should put these workarounds in somewhere either in cgofuse or in rclone!
I just returned from a trip and catching up. What are the workarounds that need to be added into cgofuse and/or rclone. Was it just -o big_writes
which could be added from the command line or were there other ones?
I just returned from a trip and catching up.
I hope it was a good one :-)
What are the workarounds that need to be added into cgofuse and/or rclone. Was it just
-o big_writes
which could be added from the command line or were there other ones?
I think -o big_writes
for linux seems like a no-brainer. It's the default for libfuse3 too.
I can put it in rclone easily enough if you don't want to add it to cgofuse.
I just returned from a trip and catching up.
I hope it was a good one :-)
Thank you. It was quite nice actually!
I think
-o big_writes
for linux seems like a no-brainer. It's the default for libfuse3 too.I can put it in rclone easily enough if you don't want to add it to cgofuse.
I think it may make more sense to add such an option to rclone as it might break compatibility for other file systems.
Ok sounds good to me :-)
Are you thinking about supporting fuse3 on Linux? That seems to be standard everywhere now.
It looks like it is something that has to be done sooner or later, although I am rather busy at the moment.
how does this perform on macOS
It's the same 16k with or without the flag, so no difference. Setting blocksize
to 1048576
does increase it to 128k-sized writes but causes file corruption.
Hi!
I'm trying to track down a performance issue with cgofuse vs bazil.org/fuse in rclone.
Here is cgofuse mounting a local disk on Linux.
And here is the mount command. Note the 4k writes
And here is bazil.org/fuse
And its mount command - note the 128k blocks
I'd say this difference is entirely down to the different sized write blocks, but I haven't been able to change it. I've tried loads of fuse parameters (
max_write
,auto_cache
) and I haven't been able to budge cgofuse from its 4k blocks for write.Fuse mysteriously says this in the docs for
max_write
This was originally noted with macOS on the rclone forum but it replicated with linux for me very easily.
Any help much appreciated - running short of hair to pull out ;-)
(Testing done with rclone master (built with
go build -tags cmount
to include cgofuse support on Linux, and use--debug-fuse
to see the fuse debug).)