Fix ZODB pickle corruption on python-3.7

kedder commented 5 years ago

This is an attempt to fix a data corruption issue, that manifests in a random byte replacing a correct one in a pickle, which makes a pickle unreadable.

I couldn't figure out the exact reason for it, but max_output_len in the fixed expression might become 0 under extreme conditions (when output_len is 0 and n is 1). The patch fixed the issue and was tested by our team for about a month.

kedder commented 5 years ago

I'll try to repro it in a test, but it is pretty difficult - the bug was very hard to trigger, I suspect it happens on a very specific set of data.

I noticed that changing MAX_WRITE_BUF_SIZE constant also affects the issue. From what I can tell, C code tries to keep buffer size below MAX_WRITE_BUF_SIZE and if buffer exceeds it, it clears it and starts growing it again. This constant and its handling was removed in future versions (see https://github.com/python/cpython/blob/master/Modules/_pickle.c), so I suspect this bug is not applicable for future versions of python.

We tried to port _pickle_33.c from the further upstream versions, but it received too many changes to make it feasible.

Unfortunately I lost the original traceback that we were seeing. Python version of the pickle code (that will be enabled if you delete the _pickle.cpython-37m-x86_64-linux-gnu.so is not affected by this bug, so that's a temporary workaround.

Another changes that seems to fix the issue (at least for one particular data set I had at hand):

Changing WRITE_BUF_SIZE to be equal to MAX_WRITE_BUF_SIZE, so buffer doesn't grow
Changing suspicious (self->output_len + n) / 2 * 3 expression to (self->output_len + n) * 2 to avoid corner cases of buffer size not growing when self->output_len + n is equal to 1.
The one suggested in this MR, which I find to be the simplest.

kedder commented 5 years ago

Another detail: this bug produces error on reading pickles. I.e. this change doesn't fix the corrupted pickles that are already stored in ZODB.

I was able to fix one corrupted pickle by carefully inspecting contents with pickletools and fixing one byte in hex editor.

dwt commented 5 years ago

This is great!

At ZODB/issues/271 I reported a problem that after a zodbupdate run during the --pack phase we had reproducible explosions under python 3.7 that didn't happen on python 3.6.

This patch seems to fix that, i.e. I can no longer reproduce that explosion. Also the error I have seen looks quite similar, in that a single byte was written wrongly somewhere in the middle of a pickle (usually an append instruction was changed to a setitem instruction), which I also was able to fix by manually editing the pickle in question. (In my case though, pickles later in the ZODB had more problems that I wasn't able to work around).

Sadly the only way I can reliably reproduce it is zodbupdate-ing a 4.5 GB ZODB from python 2 to 3, and I have no clue how to even beginn to make a reproducible test case from that. :-( Sadly the DB is full of data I cannot share, so I cannot give it to somebody with more experience minimising the problem.

Still I hope that this helps supporting this fix in that it provides at least some evidence that it does something sensible.

dwt commented 5 years ago

If it helps, I can share a pickle object with the corruption privately to people I trust.

icemac commented 5 years ago

If is is too hard to write a test at least a change log entry would be welcome.

kedder commented 5 years ago

Funny thing, using some brute force I was able to repro the issue and confirm this PR actually fixes it:

>>> import io
>>> from zodbpickle import pickle
>>> inp = ['a'] * 34991
>>> f = io.BytesIO()
>>> pickle.dump(inp, f)
>>> len(f.getvalue())
70064
>>> f.seek(0)
0
>>> outp = pickle.load(f)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_pickle.UnpicklingError: invalid load key, 'H'.

The 34991 is a magic number for (default) protocol 3. Other protocols are affected too, just with different list sizes. The error is not consistent - sometimes it is EOFError, sometimes exception is not raised, but returned unpickled value doesn't match original one.

Now, the script to find this case runs for several minutes, trying all lists of sizes up to 100K. This doesn't feel like a good fit for the otherwise quite fast test suite. Hardcoding this particular instance in a test doesn't seem to be very useful either.

What do you think, what would be the sanest way to release this?

mgedmin commented 5 years ago

Hardcoding this particular instance in a test doesn't seem to be very useful either.

I think it's fine -- and much better than not having any test case at all!

kedder commented 5 years ago

I've added the test and a changelog entry.

icemac commented 5 years ago

Released as 1.0.4 to PyPI, see https://pypi.org/project/zodbpickle/1.0.4/

zopefoundation / zodbpickle

Fix ZODB pickle corruption on python-3.7 #47