Possible bug: fsync() required after calling rename()

GoogleCodeExporter commented 9 years ago

Similar to issue 187, this bug is about what happens during a power-failure.

Also, this bug is not actually triggered atop file-systems that I usually use 
(ext3/ext4), so I haven't tried reproducing it on a real system.

The bug: When LevelDB opens a database, it compacts the current log files to a 
".sst" file, creates a new MANIFEST, and updates the CURRENT file. Updating the 
CURRENT file is done with the usual atomic technique using rename(). It then 
deletes the old log files and old MANIFEST file. 

In this entire sequence of operations, LevelDB does not explicitly ensure that 
the rename() (corresponding to the CURRENT file) actually happens before the 
unlink()s (of the old MANIFEST and log files). If the unlink()s happen before 
the rename, the database can be left corrupted (due to a missing MANIFEST) or 
wrong (due to a deleted, non-compacted log). I verified the corruption and 
wrong-behavior by simulating the unlink()s happening before the rename().

Saying that, considering Linux, neither ext3 nor ext4 never "re-order" calls 
like unlink() or rename(). Btrfs comes close to affecting LevelDB by doing 
re-ordering, but it seems to only re-order  such that rename()s get pushed 
before unlink()s, and so seems to do no real harm.

Nevertheless, LevelDB should not depend on the unlink()s happening after the 
rename(). There are lots of operating systems and file-systems out there.

Original issue reported on code.google.com by madthanu@gmail.com on 17 Jul 2013 at 3:34

GoogleCodeExporter commented 9 years ago

I'm not sure this is a good one to fix. I could see the case for fsync'ing the 
newly created .sst file, MANIFEST & CURRENT files to ensure the data is not 
lost, but isn't any journaling filesystem going to be required to create a 
separate transaction for the unlink() following a rename()? Forcing the fsync 
on the dir in the middle there just seems like extra pain, particularly on a 
filesystem with a write barrier. Am I missing something?

Original comment by cbsm...@gmail.com on 30 Sep 2013 at 9:36

GoogleCodeExporter commented 9 years ago

Quick question: Is performance the "pain" you talk about? Since you do a couple 
of fsync calls while opening the database anyway, this would probably cost you 
a maximum of 30 ms with a hard drive ...

My two cents: ext3/ext4 always persist the rename() before (or along with) the 
unlink(), and I don't think that's going to change in the future. Being a 
filesystem student myself though, I disagree with "any journaling filesystem 
going to be required" - I can imagine (in my crazy head) delaying renames the 
same way people did delayed allocations. It wouldn't serve any real purpose I 
can think of, however. (Maybe  somebody decides that immediately sending an 
unlink to disk will help security? or maybe to get more space? I don't know). 
I'm more worried about COW filesystems like Btrfs.

Original comment by madthanu@gmail.com on 30 Sep 2013 at 10:49

GoogleCodeExporter commented 9 years ago

I can see delaying when these are fully reflected in the filesystem, but don't 
you have to commit them to the journal, ensuring that *eventually* the rename 
will be reflected on disk? Doesn't that force perceived ordering at the VFS 
level?

The performance pain I'm thinking of is the additional write barrier, which can 
potentially be pretty nasty, particularly if the system has a lot of entries in 
the directory.

Original comment by cbsm...@gmail.com on 1 Oct 2013 at 7:37

GoogleCodeExporter commented 9 years ago

Eventually on disk - Yes. Perceived ordering at the VFS level - No. The reason 
is, by "delaying", I mean "delaying the rename *past* future unlinks". That is, 
the unlink might be made part of a transaction that gets committed before the 
rename, even though the user performed the unlink after the rename.

Explanation: For example, consider delayed allocation in ext4. When you append 
to a file, new inode pointers (internal filesystem stuff) are created for the 
new data; these pointers are filesystem metadata. ext3 used to happily commit 
this metadata to the journal in-order, and everything was fine and dandy: 
except for performance. ext4 came in and delayed these pointer-metadata *past* 
other filesystem operations, for performance. Everything went crazy when they 
were delayed past renames, because everybody was using "write(); rename();" for 
atomicity. (There are other parts of the story, but that's for another day.) 
However, delayed appends still mean the newly appended data will eventually end 
up on-disk; just not before the renames.

Again, "delaying renames past unlinks" doesn't apply to the current ext3/ext4 
filesystems - the perceived ordering at the VFS layer is perfectly preserved in 
those filesystems. And, ext3/ext4 will need to do *additional* stuff to 
implement delayed-renames; that makes sense only if there is a performance (or 
feature) advantage, as there was in delayed allocations.

Saying that though, I think, when the btrfs guys fixed the filesystem to make 
"write(); rename();" atomic (without an fsync), they made unlinks delay past 
renames  as a side-effect. (That's the opposite of what we are worried about.) 
So, as far as I can see, the filesystem guys don't find anything wrong with 
reordering renames and unlinks. There could be some filesystem out there that 
had a similar side-effect harming us, while doing some crazy performance 
optimization.

Performance: There are two concerns here.

In ext3, if you do "sleep(5); write (1-byte); fsync();", that fsync could take 
a reallly long time. But, if you do "sleep(5); fsync(); write(1-byte); 
fsync()", the *second* fsync will only take around a maximum of 30 ms. The 
first fsync will take long, because of the well-known "ext3 sends *everything* 
to disk when you fsync only one file" problem. In the second fsync, most of the 
"everything" has already been sent to disk the first time, so it is fast. ext3 
is the worst-performing filesystem I know of, in this aspect, so all other 
filesystems should perform better.

About the actual directory being large - I guess the thing that ends up on the 
disk after you flush the directory is only any changes to the directory 
entries. So, unless we have a thousand files created, deleted, and renamed, it 
wouldn't be a big deal. (Unless the filesystem that does something else 
additionally when fsyncing a directory). Leveldb creates the directory, so it's 
kinda owned by leveldb, so we probably wouldn't have a thousand files.

Everything said, 30 ms still sucks ... I can see not adding an additional fsync 
for this reason, but it would be nice to create a "leveldb's assumptions about 
the underlying filesystem" topic somewhere in the documentation.

Original comment by madthanu@gmail.com on 1 Oct 2013 at 9:19

GoogleCodeExporter commented 9 years ago

PS: I'm only a grad student studying filesystems, so take my replies with a 
pinch of salt. No Ted T'so here.

Original comment by madthanu@gmail.com on 1 Oct 2013 at 9:33

GoogleCodeExporter commented 9 years ago

I'll take your filesystem expertise over mine any day now.

I can certainly understand how data can be missing from the files without some 
kind of explicit calls to flush the data to disk (which was my understanding of 
the "O_PONIES" issue. The bad outcome from a write(); rename(); problem was 
*not* a 0 length file. It was a file with a length reflecting the write, but 
which did not actually contain the *data* from the write (on XFS, it would be 
filled with 0's, on some other filesystems you might see random data from 
previous files). However, that is a case where you have a race between data & 
metadata.

In the case of a rename() followed by an unlink() though... these are both 
metadata operations. Now, if you don't have metadata journaling in your 
filesystem, all bets are off, but assuming you do... I think the scenario you 
are describing shouldn't be allowed to happen.

Granted, without an fsync() on the directory, you can't be sure that *either* 
the rename() or the unlink() made it to disk, but because they are invoked 
sequentially, I think you can guarantee that if the unlink() made it to the 
journal, so did the preceding rename(). I say this partly because one of the 
problems I hear parallel filesystems people make is how POSIX imposes some 
annoying serialization of metadata operations to filesystems that they'd wish 
never happened, but also partly because specifically with rename() and 
unlink(), the filesystem would need to order the operations in its own internal 
book keeping just to be sure it unlink'd the right file (how do you know which 
inode a path refers to if you don't serialize operations?).

Now, I can understand why a filesystem might reorder any number of operations 
in terms of when any number of metadata operations are completed, but it 
doesn't make any sense (and opens a pretty big can of worms in terms of API 
contracts) to delay recording metadata operations in the journal. The rename() 
and unlink() might actually get committed from the journal in whatever order 
might be imagined, but that of course is of no consequence in terms of metadata 
loss, because once the metadata is committed to the journal, it should not get 
lost.

Given what I'm saying, are you sure there is a real risk?

Regarding the performance stuff... The fix to correct this would only be to add 
an fsync() on the directory, so no data would need to be committed. LevelDB 
does kind of "own" the directory, but by its nature it does create a lot of 
files in the directory and periodically creates and destroys them. So there is 
potentially a lot of metadata that hasn't been committed to disk. Not insane 
levels, but still... not sure it is worth belts & suspendering this aspect of 
things.

Original comment by cbsm...@gmail.com on 5 Oct 2013 at 12:24

GoogleCodeExporter commented 9 years ago

While cbsm's reasoning about why metadata operations will not be re-ordered 
holds true for simple metadata-journaling file systems, I wouldn't assume the 
same things for most modern file systems. For example, btrfs clearly re-orders 
metadata operations [1]. Also, when ext4 was introduced, the bad outcome from a 
write(); rename(); *was a 0 length file* [2] (and not a garbage-filled file, 
which was a problem when people used ext3-writeback). In general, while it is 
not simple to do so, even "metadata journaling" file systems are trying more 
and more to re-order metadata operations to improve performance. There is no 
requirement by any standards (POSIX or otherwise) to maintain a sequential 
ordering (even for directory operations).

For now, I do not think btrfs, xfs, or zfs, re-order operations in the exact 
way that affects LevelDB. I’m certain that ext4 or ext3 do not (for now). 
Future iterations of these file systems probably would. So, I do not know how 
much LevelDB has to look into this issue. However, with the new versions of 
LevelDB, an fsync() is being on the directory (while creating the MANIFEST 
file) anyway; I guess there will be only be an additional 30 ms delay when 
opening the database, even if another fsync() is added to address this issue 
(in a HDD-based desktop). If performance is a concern here, it might make sense 
to remove all those fsync() on directory that are being done now - afaik, all 
*modern* file systems (not ext2), persist the directory entry when you do an 
fdatasync() on the actual file.

Also, I found that this problem magically disappears if you do RepairDB after a 
power-failure, and I have been doing that always. I guess RepairDB reconstructs 
the MANIFEST file and such. I’m not sure if this is a recommendable strategy, 
though.

[1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg31937.html
[2] 
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-fi
le-problem/

Original comment by madthanu@gmail.com on 23 Jul 2014 at 8:40

xiaoxichen / leveldb

Possible bug: fsync() required after calling rename() #189