Open chocolate42 opened 1 year ago
I have made noe attempt at finding relevant code (since I'm incompetent at it), but the behaviour seems to be introduced with 1.3.1, which is consistent with the changelog entry that reads "I/O buffering improvements on Windows to reduce disk fragmentation when writing files." Speed might be just a side effect ... ?
An observation that might be relevant: After "first write" (i.e. first size update) Windows reports 10 485 760 bytes and Size on disk: 10 485 760 bytes too. But, after second update when Windows reports file size of 20 971 520, "Size on disk" is bigger. It is not constant across settings. Also as the number of writes (that is, the multiple of 10 MiB) increases, the difference between "Size on disk" and "size" also oscillates. But that difference seems to be pinned to zero at first write.
Also tried setting a 10MiB+8 buffer in the decoder reader (10mb8r). Same re-encode test, average of 10 runs:
HDD SSD
10mb8 9.561 7.244
10mb8r 9.552 7.083
There might be a benefit to a large buffer when reading but it's close to margin of error. Reading will also be harder to measure as OS caching likely obscures things. Not added to the repo.
But the decoder writes too, so tried hacking in a 10MiB+8 buffer into ./flac's decode.c writer. Same file as before decoded to wav. Average of 10 runs:
HDD SSD
default 11.788 6.29
10mb8dec 10.244 5.73
Less pronounced but still a definite benefit on my system (writing larger chunks may better hide the drawback of small buffers, could explain why the benefit on decode is smaller). If windows also sees a performance benefit to a large buffer this should apply. If someone wants to test potential performance improvements on windows they can try compiling the current state of my repo with the current state of the xiph repo and comparing decode when a file is generated.
I have made noe attempt at finding relevant code (since I'm incompetent at it), but the behaviour seems to be introduced with 1.3.1, which is consistent with the changelog entry that reads "I/O buffering improvements on Windows to reduce disk fragmentation when writing files." Speed might be just a side effect ... ?
That tracks.
I don't know what to make of this. FLAC calls fflush after every fwrite, so large buffers should not make a difference? Even with setvbuf, fflush is honoured as far as I know. Also, I can't explain the difference between encoding and decoding really. They both seem to write the same way (once for every frame).
Feel free to explore this further, I'll probably won't spend much time on this until I clear some bugs found by fuzzing that seem rather hard to debug.
I don't know what to make of this. FLAC calls fflush after every fwrite, so large buffers should not make a difference? Even with setvbuf, fflush is honoured as far as I know.
I don't think fflush is called every frame, fflush is present only when FLAC__VALGRIND_TESTING is defined on encode and decode, presumably only in test and/or debug builds.
Also, I can't explain the difference between encoding and decoding really. They both seem to write the same way (once for every frame).
The only difference I can see is that on decode the data is more consistent and there's more to write. More data should mean more impact if anything but there seems to be less impact.
Feel free to explore this further, I'll probably won't spend much time on this until I clear some bugs found by fuzzing that seem rather hard to debug.
I'll explore it to a conclusion as much as I can on Linux. The only other hardware I have has a modern fast SSD which probably won't be of much help and VM's seem like a bad idea here. Maybe there's a raspberry pi or slow USB stick somewhere that might be useful to make I/O issues more obvious.
I don't think fflush is called every frame, fflush is present only when FLAC__VALGRIND_TESTING is defined on encode and decode, presumably only in test and/or debug builds.
Yes indeed, I didn't see that right
Tested decode on 2193 tracks mostly 16 bit against the default build and the vbuf10m8dec (renamed 10m8) build. ~10% of tracks were decoded quicker by the default build but the sum clearly trends towards the 10m8 build, laptop was being lightly used during test which probably didn't help with variance. One build fully ran the tracks followed by the other build fully running the tracks to eliminate caching as a potential factor.
HDD wall time sum
default 5918.10
10m8 4508.47
Made a selection of builds with varying buffer sizes and tested decode with the 150MB 24 bit track, 10 runs averaged:
k=KiB, m=MiB, xm8=xMiB+8
HDD SSD
default 11.938 6.202
8k 11.614 5.809
8k8 12.117 5.786
64k 10.972 5.723
64k8 11.225 5.738
8m 10.004 5.712
8m8 10.156 5.666
10m 10.064 5.689
10m8 10.161 5.699
12m 10.062 5.688
12m8 10.144 5.669
16m 10.159 5.719
16m8 10.210 5.711
That seems to confirm a benefit but not the reason behind it. Now most +8 buffer sizes are showing as slightly worse, probably too close to margin of error to make a judgement, that being the case the +8 is probably irrelevant.
Here's some perf stats from a few decode runs. I only have a rudimentary grasp of tools like perf and gdb so this'll have to do for now. The default build has more branch misses for some reason. Less page faults on default build makes sense, less memory used so less first-time access causing a fault.
10m build
u20@u20:~/Documents/mountain/runtime/flac$ perf stat ./vbuf10m/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f
flac git-f061c6b0 20230411
Copyright (C) 2000-2009 Josh Coalson, 2011-2022 Xiph.Org Foundation
flac comes with ABSOLUTELY NO WARRANTY. This is free software, and you are
welcome to redistribute it under certain conditions. Type `flac' for details.
1159.flac: done
Performance counter stats for './vbuf10m/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f':
5,863.31 msec task-clock # 0.561 CPUs utilized
56,581 context-switches # 9.650 K/sec
28 cpu-migrations # 4.775 /sec
2,729 page-faults # 465.436 /sec
6,285,358,258 cycles # 1.072 GHz
15,821,032,908 instructions # 2.52 insn per cycle
1,368,116,924 branches # 233.335 M/sec
7,874,357 branch-misses # 0.58% of all branches
10.458442088 seconds time elapsed
5.180040000 seconds user
0.968928000 seconds sys
u20@u20:~/Documents/mountain/runtime/flac$ perf stat ./vbuf10m/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f
flac git-f061c6b0 20230411
Copyright (C) 2000-2009 Josh Coalson, 2011-2022 Xiph.Org Foundation
flac comes with ABSOLUTELY NO WARRANTY. This is free software, and you are
welcome to redistribute it under certain conditions. Type `flac' for details.
1159.flac: done
Performance counter stats for './vbuf10m/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f':
5,875.65 msec task-clock # 0.563 CPUs utilized
56,560 context-switches # 9.626 K/sec
19 cpu-migrations # 3.234 /sec
2,725 page-faults # 463.779 /sec
6,296,540,251 cycles # 1.072 GHz
15,820,642,969 instructions # 2.51 insn per cycle
1,368,024,143 branches # 232.830 M/sec
7,776,357 branch-misses # 0.57% of all branches
10.443608953 seconds time elapsed
5.085768000 seconds user
1.077939000 seconds sys
default build
u20@u20:~/Documents/mountain/runtime/flac$ perf stat ./main_build/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f
flac git-4b2c33eb 20230410
Copyright (C) 2000-2009 Josh Coalson, 2011-2022 Xiph.Org Foundation
flac comes with ABSOLUTELY NO WARRANTY. This is free software, and you are
welcome to redistribute it under certain conditions. Type `flac' for details.
1159.flac: done
Performance counter stats for './main_build/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f':
6,141.41 msec task-clock # 0.540 CPUs utilized
65,968 context-switches # 10.742 K/sec
30 cpu-migrations # 4.885 /sec
167 page-faults # 27.192 /sec
6,553,218,206 cycles # 1.067 GHz
15,961,012,058 instructions # 2.44 insn per cycle
1,399,294,282 branches # 227.846 M/sec
14,624,563 branch-misses # 1.05% of all branches
11.373024667 seconds time elapsed
5.139417000 seconds user
1.394814000 seconds sys
u20@u20:~/Documents/mountain/runtime/flac$ perf stat ./main_build/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f
flac git-4b2c33eb 20230410
Copyright (C) 2000-2009 Josh Coalson, 2011-2022 Xiph.Org Foundation
flac comes with ABSOLUTELY NO WARRANTY. This is free software, and you are
welcome to redistribute it under certain conditions. Type `flac' for details.
1159.flac: done
Performance counter stats for './main_build/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f':
6,075.46 msec task-clock # 0.542 CPUs utilized
65,971 context-switches # 10.859 K/sec
43 cpu-migrations # 7.078 /sec
165 page-faults # 27.158 /sec
6,483,611,422 cycles # 1.067 GHz
15,961,729,133 instructions # 2.46 insn per cycle
1,399,494,698 branches # 230.352 M/sec
12,424,063 branch-misses # 0.89% of all branches
11.202770781 seconds time elapsed
5.112267000 seconds user
1.340339000 seconds sys
strace confirms that the default does orders of magnitude more write syscalls
10m:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
69.93 0.116078 866 134 write
21.70 0.036026 4503 8 openat
6.93 0.011505 0 37357 read
0.74 0.001229 0 4716 clock_gettime
0.67 0.001115 0 4712 lseek
0.02 0.000026 1 19 mmap
0.00 0.000006 0 7 fstat
0.00 0.000004 1 4 brk
0.00 0.000000 0 8 close
0.00 0.000000 0 1 stat
0.00 0.000000 0 6 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 1 ioctl
0.00 0.000000 0 6 pread64
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 chmod
0.00 0.000000 0 2 1 arch_prctl
0.00 0.000000 0 1 utimensat
------ ----------- ----------- --------- --------- ----------------
100.00 0.165989 46986 2 total
default:
u20@u20:~/Documents/mountain/runtime/flac$ strace -c ./main_build/src/flac/flac -d /media/u20/Data/1159.flac -o /media/u20/Data/1159.wav -f
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
50.13 0.055369 5 9540 write
23.28 0.025717 3214 8 openat
22.05 0.024356 0 37357 read
2.38 0.002626 0 4717 clock_gettime
2.07 0.002289 0 4712 lseek
0.03 0.000028 1 18 mmap
0.03 0.000028 7 4 brk
0.02 0.000024 3 8 fstat
0.01 0.000013 1 8 close
0.00 0.000004 4 1 ioctl
0.00 0.000000 0 1 stat
0.00 0.000000 0 6 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 6 pread64
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 chmod
0.00 0.000000 0 2 1 arch_prctl
0.00 0.000000 0 1 utimensat
------ ----------- ----------- --------- --------- ----------------
100.00 0.110454 56393 2 total
9417 write syscalls vs 23 spent writing the wav file, the rest are for the progress bar. The above only shows time taken to do the syscall, not time taken in the syscall. strace -T outputs time taken in each syscall, summing the output of that results in this, which I believe is the time taken to actually write the wav file (possibly with or without the time to do the syscall from the first part of this post):
Seconds
default 2.275357
10m 1.621201
What's curious is that each frame is written in two write calls for the default build, here's the start of the write calls for the default build:
write(3, "RIFF\366\320\312\rWAVEfmt (\0\0\0\376\377\2\0\0\356\2\0\0\224\21\0"..., 4096) = 4096 <0.000080>
write(3, "\377\263\374\377:\0\0|\1\0\0\1\0\373\3\0\241\375\377\244\376\377w\1\0\260\3\0\333\375\377\31"..., 45056) = 45056 <0.000307>
write(3, "\204\377\330\r\0{\215\377z5\0\325\254\377)\3\08\214\377\355\t\0\263\236\377\315\362\377\327o\377"..., 4096) = 4096 <0.000053>
write(3, "\0\362\30\0TR\0\362H\0O\4\0#\377\377i\10\0\366\24\0\266\27\0k4\0\2404\0\211"..., 45056) = 45056 <0.000295>
write(3, "\376\1\377B\377\361\34\2t\35\377L\23\0023O\377\tF\2wX\377\0010\2\345y\377\242/\2"..., 4096) = 4096 <0.000047>
write(3,
Which presumably is because the header partially fills the buffer, then each frame in this example file wants to write 49152 bytes. At each frame the rest of the buffer is filled and written, then the remaining multiple of 4096 is written at once, then the buffer is partially filled to be written with the next frame.
It looked like a default build that also flushed the header might half the number of write syscalls when a small buffer aligns neatly (like decoding to a simple format like wav with a typical blocksize of 4096, very common), but it didn't:
write(3, "RIFF\366\320\312\rWAVEfmt (\0\0\0\376\377\2\0\0\356\2\0\0\224\21\0"..., 68) = 68 <0.000071>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\2\0\0\373\377\377\374\377\377\1\0"..., 4096) = 4096 <0.000211>
write(3, "\375\377\231\376\377\356\0\0\254\372\377-\376\377\307\0\0\214\1\09\377\377\207\0\0\271\2\0\252\0\0"..., 45056) = 45056 <0.000466>
write(3, "z\16\0\257\222\377\274\21\0\300s\377v1\0\334\241\377N\363\377.\177\3774\31\0\265\220\377\345!"..., 4096) = 4096 <0.000072>
write(3, "0\0005!\0\233\16\0G0\0\17+\0\324\3\0\330\35\0T\325\377\34'\0_\1\0jq\0"..., 45056) = 45056 <0.000433>
Still doing 4096+45056 instead of a single write of 49152. Double checked and there was nothing else written after the header to misalign the buffer, so it looks like (at least in gcc and clang's implementation) that the buffer is filled even if it starts empty, then the remainder (maybe in full or the largest multiple of buffer size) is written in a single call. Unfortunate for the edge case, but the solution that works everywhere is to have a buffer that's at least as big as the maximum expected raw data contained in a frame.
The doubling of write syscalls is an inefficiency when the buffer is too small, but it might not be the whole story. It might not even be a major part of why it's slower, but it would be convenient if it were.
With gcc and presumably clang the setvbuf buffer size is ignored if buffer is NULL. This is unlike windows, which resizes an internal buffer (which is how flac currently does it).
The current state of https://github.com/chocolate42/flac/tree/vbuf implements the external buffer properly instead of hacking it in, and moves _WIN32 builds to using an external buffer instead of internal, so everything uses a 10MiB external buffer.
Couldn't implement external buffer for stdout, because stdout is not closed. A program using stdout will have to use setvbuf on stdout if they want to (which they might, there's a benefit at least when piping to file).
For _WIN32, other than using an external buffer the only other difference is that stdout is determined with file != stdout
instead of using if(GetFileType((HANDLE)_get_osfhandle(_fileno(file))) == FILE_TYPE_DISK)
. This seems fine? It's good enough elsewhere in the code, in the case where the file is a non-stdout pipe the code elsewhere probably breaks (freeing a pipe that may be re-used). Maybe I'm misunderstanding but I don't think arbitrary pipes are currently supported, supporting would involve overhauling everywhere that checks for stdin/stdout instead of an arbitrary pipe.
What remains to be solved to potentially turn this into a PR:
#if 0
's in the latest commit are changed to #ifdef _WIN32
and the diverging paths remain. Merging paths would be ideal because that's a mess.* Determine what targets a large buffer cannot be used with and exclude them
I haven't looked into your changed in depth, but I would consider moving the flac
command line tool from using FLAC__stream_encoder_init_file()
to FLAC__stream_encoder_init_FILE()
. That way you can implement the whole buffer thing in flac
and leave libFLAC untouched, which massively reduces the number of extreme cases you're dealing with. Especially embedded application tend to use libFLAC but not the command line tools.
That makes a lot of sense.
Measuring IO is apparently hard, the latest results make it questionable when and where the apparent gains are to be had. These are the old results to compare against
HDD wall time sum
default 5918.10
10m8 4508.47
After trying some Linux kernel tweaks to try and disable/flush cache to get more consistent results these new results happened (no defrag)
HDD wall time sum
default 3191.47
10m 3271.49
The default is the exact same build for both tests. Not only are the results suspiciously much better and the two builds close to margin of error, the results are also much less consistent. As mentioned previously the old test was pretty consistent, this test had wild swings in both directions.
My best guess as to what's happening is that the new measured time is junk. fflush on Linux doesn't guarantee that the data is written to disk, only that the buffer is emptied and that it's up to the kernel to finish things. Whatever stuck from the kernel tweaks I recklessly tried (and can't seem to reverse) seems to have resulted in the kernel returning quicker from writing than it did before, possibly very quickly. The kernels internal buffer doesn't seem to have enough time to empty across decodes, when it builds up too much it seems to fully empty at which point whatever test was running gets a severe wall time penalty. If that's what's actually happening then the only writes that are reflected in the new data are the ones where the kernel fully dumps its buffer.
The next thing to try is fsync on fclose, which should push the data in the kernel buffer to the HDD and should make the results consistent again. Even fsync doesn't guarantee that the data is fully written, only that the disk fully has it (so SSD's have it at least in their RAM), but this is an old HDD with limited cache so it should be good enough to test.
Disregard any prior benchmarks or conclusions drawn, I vastly misunderstood how aggressively and when laptops put themselves in different states so it's all tainted. The following results try to control as many variables as is feasible (laptop plugged in fully charged, no UI, no other access to the disk, screen off, no other programs open, cooldown between runs). Fairly confident that these results are as accurate as they can be.
default 10m_fsync 10m_nofsync
script time 3603.91 3109.06 2651.81
flac time summed 3574.27 3081.07 2626.36
Average 1.62985 1.40496 1.19761
Standard deviation 1.80209 0.92650 1.34400
Normalised standard deviation 1.00000 0.59642 1.01498
Will eventually test on other hardware and redo the buffer size test to see if there's potentially a more appropriate size. vbuftest.ods
Tried the same decode test using a fast M.2 ssd with a Zen2 4700u APU. Had to rebuild everything, these use gcc 12.2 and clang 11.1, the previous test used gcc 9.4 but the difference is probably minimal.
default default 10m_fsync 10m_fsync 10m_nofsync 10m_nofsync
gcc clang gcc clang gcc clang
script time 662.01 676.75 719.42 721.72 671.72 689.68
flac time sum 648.46 663.42 705.81 708.03 658.25 676.26
average 0.2957 0.3025 0.3218 0.3229 0.3002 0.3084
st.dev 0.2078 0.2254 0.2200 0.2314 0.2199 0.2247
The disk is fast enough that buffer size is not relevant using a single core. Parallelising with parallel only utilised an estimated ~330% of 8 cores and cut the time to ~210-225 seconds, still not enough for the default build to be noticeably inadequate using these tests. Processing 60GB of flac into 98GB of wav in under 4 minutes is neat.
Slow sata SSD's are common and sit between HDD and fast SSD in specs. I'm guessing the default build is adequate there under normal conditions (stronger compression, single core), maybe slow SSD's feel the pressure when fully threaded.
Tried a thumb drive with the zen2, not the slowest with ~15MB/s writes but not quick. With roughly the first third of the corpus as the full corpus doesn't fit on the drive. Fresh format, no fragmentation
default_gcc 10m_fsync_gcc 10m_nofsync_gcc default_clang 64k_nofsync_gcc
script time 3008.33 3014.97 2969.59 2976.72 2949.04
flac time sum 2779.32 2993.86 2723.95 2695.7 2717.38
So it looks like the benefit measured before is from fragmentation and not an OS difference, same reason init_file sets the 10m buffer on windows for encode. Don't fancy trying to defrag an ntfs partition on Linux to find out for sure, sounds like a recipe for data loss.
Curiously it looks like the data is fsynced at the end of the script, don't know what else explains the script times all aligning but not the sum. Nothing in the script explains the large discrepency between script time and flac sum, it's just a loop that calls flac for every file.
The current state of the vbuf branch does as suggested, flac uses init_FILE and all changes are contained to encode.c/decode.c. WIN32 also uses an external buffer, it still needs to be tested on windows to make sure encode speed remains the same and that decode speed has improved in the presence of fragmentation. I can probably check that win32 works but not benchmark from a VM as the VM would translate to Linux calls and we want to benchmark the windows calls.
I can test on Windows.
Something that might be relevant: what filesystem did you test on? You mentioned ntfs once, are all tests on ntfs?
Thanks. The laptop hdd is ntfs with the OS on a primary SSD. The fast SSD is ext4 with OS on the same partition. The USB stick is FAT32.
So, could you summarize what you're seeing?
What hardware were you using for the numbers in the post where you said to disregard all earlier measurements? Was that an HDD?
The current state of your vbuf branch, that is 'nofsync' I presume?
The results from the disregard post were the HDD on laptop with i7-6700hq. The post after that tested the fast ssd in a miniPC host containing Ryzen 4700u. The post after that tested the USB stick on miniPC host. The current state of vbuf is 10m_nofsync, just a 10MiB external output buffer everywhere.
tl;dr I'm seeing a big speedup when writing to a presumably fragmented NTFS HDD, a set of flac that takes an hour to decode to wav on the default build takes 45 minutes with a 10MiB buffer. All other hardware hasn't shown a significant difference.
I've tried to replicate. Here are results for running on Ubuntu 22.04, Intel Celeron N5105, Seagate Barracuda 2.5 5400. ext4 partition.
$ for I in {1..20}; do for EXE in flac-currentgit flac-vbuf; do echo -n $EXE; rm /home/martijn/bin/temp/*; /usr/bin/time -f ": %e" ~/bin/$EXE -s */*.flac */*/*.flac --output-prefix=/home/martijn/bin/temp/ -0; done; done
flac-currentgit: 1042.88
flac-vbuf: 1109.46
flac-currentgit: 1092.47
flac-vbuf: 1058.09
flac-currentgit: 1101.90
flac-vbuf: 1079.19
flac-currentgit: 1093.96
flac-vbuf: 1102.59
flac-currentgit: 1081.33
flac-vbuf: 1170.21
flac-currentgit: 1190.34
flac-vbuf: 1245.58
flac-currentgit: 1162.68
flac-vbuf: 1418.31
flac-currentgit: 1646.74
flac-vbuf: 1851.07
flac-currentgit: 1091.20
flac-vbuf: 1059.79
flac-currentgit: 1281.78
flac-vbuf: 1207.90
That gives me the following numbers
currentgit vbuf
average 1178,53 1230,22
stddev 178,32 244,39
stderr 56,39 77,28
Looking at the averages and standard error, I cannot conclude which one is faster, but I think it is safe to say any possible difference gets drowned in the noise anyway.
I do have a theory as to why adding a buffer might make encoding slower: having a small buffer, the OS can start writing right away while buffering a bit. However, when the program itself implements a buffer, the OS cannot start sending data to the harddisk until that buffer is full.
Results on windows will follow later.
Thank you for trying to confirm.
I do have a theory as to why adding a buffer might make encoding slower: having a small buffer, the OS can start writing right away while buffering a bit. However, when the program itself implements a buffer, the OS cannot start sending data to the harddisk until that buffer is full.
That makes sense if the OS is fully writing one flac file before letting the next flac process run, aka there is an implicit fsync so there's time at the start of every file where there is no write in progress.
What I'm seeing I've interpreted as the next file being processed while the previous one is still being written by the kernel, the previous file has left the fwrite buffer thanks to the implicit fflush from fclose, but is still in the kernel buffer being written to disk after the process that made it has exited. There's wildly variable behaviour of the default and 10m_nofsync builds on the HDD, the same results as before but highlighted for clarity:
What I've attributed this to is the kernel buffer getting full and the flac process having to wait for the kernel buffer to empty enough during fwrite. My interpretation may be entirely wrong, pure guess. My OS is nothing special AFAIK, Ubuntu 20.04:
uname -a
Linux u20 5.15.0-67-generic #74~20.04.1-Ubuntu SMP Wed Feb 22 14:52:34 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Sorry for the mess, no matter what the reason is it seems to be peculiar behaviour. I'll try running the USB on the laptop to see if the behaviour matches the USB result from the miniPC or the HDD result from the laptop. It should match the miniPC result, but if it matches the HDD result then there's definitely something going on with the laptop OS
The USB result behaves the same on the Ubuntu 20.04 laptop as the miniPC, mildly supporting that it's not OS weirdness on slow media causing the HDD result and likely is a result of fragmentation.
Reran the HDD test on the laptop using a live image of Fedora 38 and rebuilt with gcc 13. Did a run of default then 10m, rebooted, then a run of 10m then default to rule out caching:
default_run0 10m_run0 10m_run1 default_run1
flac sum time 2887.02 2089.43 2141.82 2902.73
script time 2919.42 2114.49 2170.31 2932.13
This rules out any weirdness that may have accumulated on the daily driver Ubuntu 20.04 that's been in service for 3 years, or from old toolchains. Admittedly it's the same hardware, but two OS's have confirmed that the 10m build performs better when using a (very likely) fragmented HDD.
How much space is left on that device when running the tests? If that is more than 20%, fragmentation should not be too much of an issue. On the other hand, the linux ntfs handling might not be as optimized as one Windows, let alone as the ext4 handling.
It's 98.1% full, 18.5GB space free out of ~1TB. Linux NTFS handling certainly leaves a lot to be desired, or it did at least. My Ubuntu 20.04 uses ntfs-3g which is the old way of handling ntfs (feature-incomplete but mature, implemented via FUSE which is a userspace solution which incurs extra context switching overhead at the very least compared to a kernel implementation). A much better replacement made it's way into the kernel last year (ntfs3) but it looks like it stalled from world events so is still WIP, fedora 38 probably hasn't enabled it by default. It's possible that the additional ntfs-3g context switching is the bottleneck, I doubt it but will have to figure out if the ntfs3 driver is default on the fedora live image and if not redo the test with it to see if there's a difference.
Even ext4 is seeing improvements still ( https://www.phoronix.com/news/Linux-6.3-EXT4 ). Tangentially related, io_uring is an exciting new async IO interface for Linux, as much as IO interfaces can be exciting. It's unlikely that it's that much of a benefit to flac's synchronous access pattern but there are likely at least minor gains from less system call overhead.
Checked the live image and it turns out Fedora 38 does use ntfs3, so the Fedora result above rules out an unoptimised ntfs-3g as an issue as ntfs3 shows similar gains with the 10m build.
Note the HDD results between OS's are not directly comparable, AFAIK there's nothing that can be read from the Ubuntu runs taking longer than the Fedora runs. To control for laptop power state shenanigans I had the Ubuntu runs wait 10 minutes to get into a stable (not ramping due to UI) lower power state. The Fedora live image didn't seem to have any adaptive power behaviour, presumably the live image doesn't enable any of that for a snappier live impression. There's also probably different schedulers in use and all sorts of different settings.
I've finally taken some time to test this under Windows, and I must say, under the right conditions the difference is horrific, truly.
I've compared three compiles, one with the vbuf tree of your repository, one with that same tree but rolled back a few commits and one rolled back, with the setvbuf
in stream_encoder.c (which has been there for many years) removed. Respectively they're called flac-extravbuf.exe
, flac-nochange.exe
and flac-novbuf.exe
.
I've taken an old 16GB no-name exFAT USB-stick and filled it for 98%. In the remaining 2% testing was done, first copying a WAV file there, then compressing it to a FLAC, then decompressing it to a WAV. These 3 files are then removed, making space for the next round.
Differences are night and day, for both encoding as well as decoding.
Testes presets are 0, 1, 2, 3 and 4. It seems encoding gets 10x as fast, and decoding gets 4-5x as fast. I think the reason 3 and 4 decode so much faster is because of the larger blocksize (hence larger writes) and 4096 4 is probably a much better write size than 1152 4.
So, clearly, I'd say this is an improvement. Even if it gives a slight slowdown in situations where fragmentation is low (which I can't say for sure) the advantages far outweigh the (possible) disadvantages.
I'll do some testing on using smaller block sizes, perhaps somethink like 128KiB already improves a lot.
I'm wondering though, should the setvbuf be removed from libFLAC, in effect moving it to the flac command line tool? Or should we only add it to the flac
command line tool for non WIN32 encoding? Or should we duplicate it, like the current state of the repo does? Maybe calling setvbuf twice on the same file doesn't really do anything?
Here are some more detailed results
I'd say using a 32KiB buffer already gives a nice performance improvement. I'm not sure whether it makes much sense, but I think 10MiB is a bit wasteful. Then again, that is what's been in FLAC for years now on Windows.
I'll do some more testing, trying to get the same result with a heavily fragmented ext4 partition or something like that.
Some more results. Took a 32GB NTFS partition on a 250GB SSD (several years old) and filled it for 99%. Then ran the tests on Windows 10. set of 16bit tracks vbuf SSD.pdf
X-axis is different here, because logarithmic scale doesn't show any labels. Difference is small but consistent. I have no clue why 32KiB and 10MiB buffer would be slower here than a 128KiB and 512KiB buffer on decoding.
Even more results. I tried to test something a little more realistic, so took a 1TB HDD, filled it for 95% and ran tests. The results are opposite of the previous few: adding a buffer slows down encoding by about 20% for some reason, while a buffer of 128KiB or 512KiB speeds up decoding by about 40%. These results don't seem to make sense, but they are reproducible, I've ran the tests several times.
Also, running the binaries is interleaved, 10 tracks are tested, and each track gets tested which each binary before moving on to the next track.
Sorry I just noticed your latest posts.
These results don't seem to make sense
In a way I'm glad not to be the only one mystified by benchmarking I/O.
I'd say using a 32KiB buffer already gives a nice performance improvement. I'm not sure whether it makes much sense, but I think 10MiB is a bit wasteful. Then again, that is what's been in FLAC for years now on Windows.
I came to the conclusion that at least 64KiB showed a benefit but in the MiB range did help a bit more, but that was during some unreliable benchmarking so inconclusive. Research indicated 8MiB could be beneficial and that there was little benefit to going beyond 16MiB, but I didn't save the link.
I'm wondering though, should the setvbuf be removed from libFLAC, in effect moving it to the flac command line tool?
IMO control of how a stream is handled should stay with whatever opened the stream, ie libflac when init_file is used or the external program when init_FILE is used. Quoting a manpage:
The setvbuf() function may only be used after opening a stream and before any other operations have been performed on it.
We don't know the history of the stream so can't guarantee that no ops have been performed. Similarly that's why I suggested elsewhere to only fclose in libflac if libflac was the one to fopen (for all we know the stream could be an archive or chained ogg file that the external program doesn't necessarily want closed).
Even if it gives a slight slowdown in situations where fragmentation is low (which I can't say for sure) the advantages far outweigh the (possible) disadvantages.
If you mean a slowdown because a larger buffer takes longer for the initial flush, IMO that's not really relevant and not really a slowdown. When I/O pressure is low (kernel+hardware buffer is not saturated, ie fast hardware and/or small data) it just delays the actual write slightly and shouldn't alter ./flac's execution time. When I/O pressure is high (kernel+hardware buffer is saturated so an fwrite has to wait during runtime, ie slow hardware and/or a lot of data like processing an entire collection) there are bigger factors at play than a millisecond delay on first flush.
I've repeated the last results, but I've added an executable that not only has a vbuf on the output file but also a vbuf on the input file. That seems to alleviate the problem.
That reminds me of hacks in https://github.com/onion108/cplusplus-output-benchmark/issues/1
This ( https://hydrogenaud.io/index.php/topic,123889.msg1024853.html#msg1024853 ) post prompted me to poke around fwrite and hack in setting the fwrite output buffer size for non-win32 targets: https://github.com/chocolate42/flac/tree/vbuf
Tested by re-encoding a ~150MB 24 bit flac file to -0 to the same drive, on a 2.5" HDD and a slowish SSD on a skylake laptop. Tried a buffer of 10MB to match the win32 path, and a buffer of 10MB+8 based on the recommendations in this post: https://www.enterprisestorageforum.com/hardware/a-trip-down-the-data-path-i-o-and-performance/
These are the average wall times of 10 runs each in seconds on Linux 64 bit:
Didn't make a proper PR directly because there's some open questions