Possible bug: After a power-loss, LevelDB does not discard partially-flushed transactions

GoogleCodeExporter commented 9 years ago

This (possible) bug is about what happens during a power loss. The bug also 
depends on how the database is opened after rebooting: we run RepairDB (with 
paranoid checksum switched on) before trying to retrieve values.

The bug: When a Put() is issued, LevelDB appends to the log file; the appends 
result from write() calls issued by the internal EmitPhysicalRecord() function. 
Consider that a power loss happens, and the last append corresponding to the 
Put() results in the appended portion of the file containing garbage or zeros 
(as possible with writeback journaling file systems). If we then try to 
retrieve values from the database after rebooting from the power loss (and 
after executing RepairDB), we get back a corrupted value corresponding to the 
Put().

Similar behavior occurs if the power loss results in a situation as follows: 
one of the appends is buffered in-memory, but other appends following it (to 
the same file) are flushed to the disk by the time of the power loss. The 
region corresponding to the buffered append can be filled with zeros or 
garbage; this situation does not typically happen, but it might (with a 
writeback file system) if the buffer cache and the disk cache behaves 
atypically.

If RepairDB is not used, the result of the bug is different than described 
above. Also, if RepairDB is not used, there are many power-loss-related 
situations in which LevelDB seems to behave badly. Please let me know if it is 
not necessary (or not a good practice) to use RepairDB after every power loss. 

What steps will reproduce the problem?
1. Use a separate partition with the ext3 file system under the writeback mode 
(mount -o data=writeback), for the database. No other background process should 
be writing to the file system; this lets us easily simulate the timing 
interleaving necessary for the bug to happen.
2. The EmitPhysicalRecord function in log_writer.cc has a Flush() call on the 
log file (line 94 in version 1.15). Just before that call, add an fdatasync() 
to the log file. This is again for the timing interleaving.
3. Insert a 45000 characters-long key-value pair, using an asynchronous Put(), 
and then do an infinite loop.
4. Wait for 5 seconds, and pull off the power chord (the power chord should be 
pulled back between the 5th and the 25th second). 
5. After rebooting the machine, re-open the database with paranoid checksums, 
run RepairDB, and try reading the values.

What is the expected output? What do you see instead?
The inserted value, or an empty database, is expected. A corrupted value is 
seen.

What version of the product are you using? On what operating system?
LevelDB 1.15, on Ubuntu 12.04.

Please provide any additional information below.
1. I have not tried to obtain a minimal test-case (steps to reproduce the 
problem), let me know if a smaller test-case will help.
2. I found this bug by using a tool that lets us simulate various power-loss 
scenarios; let me know if the tool will be useful.
3. In general, it seems as if LevelDB tries to retrieve Put()-s that were only 
partially persisted to the log before the power loss, instead of discarding 
them after reboot. I’m guessing this also based on the behavior if RepairDB 
is not used after the reboot.

Original issue reported on code.google.com by madthanu@gmail.com on 31 Jul 2014 at 12:30

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

After further testing, this possible bug also seems to exist in LevelDB-1.17 
(not solved by 
https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd
5d39d177c8#). LevelDB-1.17 does not, however, require RepairDB to be called for 
simple process-crash scenarios. 

Also, if this bug is triggered atop LevelDB-1.17 without RepairDB, instead of 
getting back a corrupted value, an error is returned (assuming paranoid 
checksums is used).

The same problems are also exhibited during the append to MANIFEST files (again 
done by EmitPhysicalRecord), during compaction.

Original comment by madthanu@gmail.com on 8 Aug 2014 at 11:38

xiaoxichen / leveldb

Possible bug: After a power-loss, LevelDB does not discard partially-flushed transactions #245