yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.84k stars 1.05k forks source link

[DocDB] Consider replacing fsync with fdatasync on the WAL write path #16598

Open bmatican opened 1 year ago

bmatican commented 1 year ago

Jira Link: DB-5995

Description

As per discussions with @fritshoogland-yugabyte , given we pre-allocate our WAL files, we might be able to relieve some disk pressure by using fdatasync instead of fsync

Frits is working on a good test bed for us to validate the win from this, but implementation wise, this should be relatively easy to pull off and put behind a new gflag. FYI @rthallamko3 @Huqicheng @yusong-yan @ttyusupov

Warning: Please confirm that this issue does not contain any sensitive information

fritshoogland-yugabyte commented 1 year ago

In general, fsync forces the OS to flush the dirty pages for a given inode/file, as well as the filesystem journal. For more info see: https://dev.to/yugabyte/the-anatomy-of-xfs-fsync-4ael

There is an alternative fsync call that, if safe, will only flush the dirty pages and not the journal, which is "fdatasync". The journal can be a serialisation point, which is made clear in the above text about fsync. The fdatasync call is safe, because if it detect a file structure change that needs flushing of the journal, it will automatically perform that.

YugabyteDB pre-allocates its WAL files, and therefore the WAL write that needs flushing/persistence for transactions can use the fdatasync call without requiring the journal write, because there is no inode/file structure change.

PostgreSQL also uses fdatasync, as well as pre-allocates its WAL files. The fsync options and implementation have been thoroughly been looked at in 2018 when linux was found to optionally not show IO errors in some cases, which has been fixed.

I performed a simple test in a VM on my laptop to see the difference between calling fsync() and fdatasync() after a write() calls: image

This shows tests for calling fdatasync() (Fsync/fdatasync), fsync() (Fsync/fsync) and no synchronisation call (Fsync/no sync). The violin plot shows the variance and the relationship between the tests.

The fdatasync() test shows a mean of 443 us, The fsync() test shows a mean of 1,061 us, The no sync test shows a mean of 9.8 us.

These are tests with a pre-allocated file of 64M using a buffered write of 24k. -> In a VM on my laptop (!)

The sourcecode for testing this is here: https://github.com/fritshoogland-yugabyte/benchmark