[DocDB] Consider replacing fsync with fdatasync on the WAL write path

Jira Link: DB-5995

Description

As per discussions with @fritshoogland-yugabyte , given we pre-allocate our WAL files, we might be able to relieve some disk pressure by using fdatasync instead of fsync

for any calls during the pre-allocated part of the file, we should see benefits
for the final call, when we cross the pre-allocated section of the file (and thus will close the file afterwards), fdatasync anyway seems to "promote" to fsync, thus being functionally what we want

Frits is working on a good test bed for us to validate the win from this, but implementation wise, this should be relatively easy to pull off and put behind a new gflag. FYI @rthallamko3 @Huqicheng @yusong-yan @ttyusupov

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

In general, fsync forces the OS to flush the dirty pages for a given inode/file, as well as the filesystem journal. For more info see: https://dev.to/yugabyte/the-anatomy-of-xfs-fsync-4ael

There is an alternative fsync call that, if safe, will only flush the dirty pages and not the journal, which is "fdatasync". The journal can be a serialisation point, which is made clear in the above text about fsync. The fdatasync call is safe, because if it detect a file structure change that needs flushing of the journal, it will automatically perform that.

YugabyteDB pre-allocates its WAL files, and therefore the WAL write that needs flushing/persistence for transactions can use the fdatasync call without requiring the journal write, because there is no inode/file structure change.

PostgreSQL also uses fdatasync, as well as pre-allocates its WAL files. The fsync options and implementation have been thoroughly been looked at in 2018 when linux was found to optionally not show IO errors in some cases, which has been fixed.

I performed a simple test in a VM on my laptop to see the difference between calling fsync() and fdatasync() after a write() calls:

This shows tests for calling fdatasync() (Fsync/fdatasync), fsync() (Fsync/fsync) and no synchronisation call (Fsync/no sync). The violin plot shows the variance and the relationship between the tests.

The fdatasync() test shows a mean of 443 us, The fsync() test shows a mean of 1,061 us, The no sync test shows a mean of 9.8 us.

These are tests with a pre-allocated file of 64M using a buffered write of 24k. -> In a VM on my laptop (!)

The sourcecode for testing this is here: https://github.com/fritshoogland-yugabyte/benchmark

yugabyte / yugabyte-db

[DocDB] Consider replacing fsync with fdatasync on the WAL write path #16598

Description

Warning: Please confirm that this issue does not contain any sensitive information