xiaoxichen / leveldb

Automatically exported from code.google.com/p/leveldb
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

LevelDB seg faults on Get operations with certain .ldb corruptions #217

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

I first saw this bug while testing production code that uses a JNA wrapper, and 
I was able to reproduce it by modifying the TableFile test in 
corruption_test.cc in the following ways: 
1. Alter the corruption algorithm to corrupt bytes in random locations instead 
of just the end of the file. 
2. Instead of corrupting bytes with an XOR with a hard-coded byte, corrupt 
bytes by flipping all of their bits. 
3. Instead of iterating over the DB to confirm corruption, perform a Get for 
all of its keys. 
4. Do the above repeatedly, with different random corruptions, and eventually 
you'll get a segmentation fault (at least on OSX). 

What is the expected output? What do you see instead?

I except Get operations to return corrupted data, and usually they do, but 
occasionally I get a seg fault that kills the whole process. 

What version of the product are you using? On what operating system?
1.14, OSX 10.8.5. 

Please provide any additional information below.

I have attached a modified version of corruption_test.cc that implements the 
changes described above. I also altered test_harness.cc to run the test 100 
times, which is always sufficient to cause the seg fault. And I stripped out 
the other tests from the Makefile. Swap these into the LevelDB project, and you 
should be able to replicate the seg fault with make check. 

Original issue reported on code.google.com by jpflana...@gmail.com on 15 Nov 2013 at 8:03

Attachments:

GoogleCodeExporter commented 9 years ago
Here's what's happening. When the byte(s) at the begging of a .ldb file (or 
.sst) file get corrupted, the call to Block::NumRestarts() can return a crazy 
value, which makes the "mid" variable in Seek take on a crazy value, which is 
passed to GetStartPoint, and ultimately gets passed to memcpy inside 
DecodeFixed32 in coding.h. Kaboom. 

One simple fix is to put a sanity check in the Block::NewIterator function. 

Iterator* Block::NewIterator(const Comparator* cmp) {
  if (size_ < sizeof(uint32_t)) {
    return NewErrorIterator(Status::Corruption("bad block contents"));
  }
  const uint32_t num_restarts = NumRestarts();
  if (num_restarts == 0) {
    return NewEmptyIterator();
  } else if (num_restarts > 4096){ //check for crazy values, use max block size as ceiling
    return NewEmptyIterator();
  } else {
    return new Iter(cmp, data_, restart_offset_, num_restarts);
  }
}

Original comment by jpflana...@gmail.com on 15 Nov 2013 at 11:29

GoogleCodeExporter commented 9 years ago
[Copied from an earlier email message on 2013/11/19]

The code already contains sanity checks for NumRestarts (the Block constructor 
sets size_ to zero if it sees an obviously bad restart value and the iterator 
constructor returns an error if size_ is zero).

The reason your changes cause a crash in corruption_test.cc is because the 
corruption occurs between the block constructor and the iterator construction. 
The inserted corruption affects the in-memory contents of the constructed Block 
since on posix leveldb uses mmap and so the corrupted bytes show up inside the 
Block.

We could add another check in the iterator constructor as you suggest, but that 
won't really fully fix the problem. The corruption could have just as easily 
happened after the iterator was constructed. To fully protect from such 
corruptions, we would need to instrument every access into the block contents 
in the fast path of iterator's operations. That's a pretty big undertaking and 
will also slow down reads. And it might be very hard to make the code 
bulletproof (think of for example the binary search going into an infinite loop 
because a corruption violates sorted-order of keys).

Given that this is a testing induced failure, I am not sure it is worth 
complications and slowdowns to deal with this case.

Original comment by san...@google.com on 9 Apr 2014 at 4:43