Closed kedder closed 5 years ago
I'll try to repro it in a test, but it is pretty difficult - the bug was very hard to trigger, I suspect it happens on a very specific set of data.
I noticed that changing MAX_WRITE_BUF_SIZE
constant also affects the issue. From what I can tell, C code tries to keep buffer size below MAX_WRITE_BUF_SIZE
and if buffer exceeds it, it clears it and starts growing it again. This constant and its handling was removed in future versions (see https://github.com/python/cpython/blob/master/Modules/_pickle.c), so I suspect this bug is not applicable for future versions of python.
We tried to port _pickle_33.c from the further upstream versions, but it received too many changes to make it feasible.
Unfortunately I lost the original traceback that we were seeing. Python version of the pickle code (that will be enabled if you delete the _pickle.cpython-37m-x86_64-linux-gnu.so
is not affected by this bug, so that's a temporary workaround.
Another changes that seems to fix the issue (at least for one particular data set I had at hand):
WRITE_BUF_SIZE
to be equal to MAX_WRITE_BUF_SIZE
, so buffer doesn't grow(self->output_len + n) / 2 * 3
expression to (self->output_len + n) * 2
to avoid corner cases of buffer size not growing when self->output_len + n
is equal to 1.Another detail: this bug produces error on reading pickles. I.e. this change doesn't fix the corrupted pickles that are already stored in ZODB.
I was able to fix one corrupted pickle by carefully inspecting contents with pickletools
and fixing one byte in hex editor.
This is great!
At ZODB/issues/271 I reported a problem that after a zodbupdate
run during the --pack
phase we had reproducible explosions under python 3.7 that didn't happen on python 3.6.
This patch seems to fix that, i.e. I can no longer reproduce that explosion. Also the error I have seen looks quite similar, in that a single byte was written wrongly somewhere in the middle of a pickle (usually an append instruction was changed to a setitem instruction), which I also was able to fix by manually editing the pickle in question. (In my case though, pickles later in the ZODB had more problems that I wasn't able to work around).
Sadly the only way I can reliably reproduce it is zodbupdate
-ing a 4.5 GB ZODB from python 2 to 3, and I have no clue how to even beginn to make a reproducible test case from that. :-( Sadly the DB is full of data I cannot share, so I cannot give it to somebody with more experience minimising the problem.
Still I hope that this helps supporting this fix in that it provides at least some evidence that it does something sensible.
If it helps, I can share a pickle object with the corruption privately to people I trust.
If is is too hard to write a test at least a change log entry would be welcome.
Funny thing, using some brute force I was able to repro the issue and confirm this PR actually fixes it:
>>> import io
>>> from zodbpickle import pickle
>>> inp = ['a'] * 34991
>>> f = io.BytesIO()
>>> pickle.dump(inp, f)
>>> len(f.getvalue())
70064
>>> f.seek(0)
0
>>> outp = pickle.load(f)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_pickle.UnpicklingError: invalid load key, 'H'.
The 34991 is a magic number for (default) protocol 3. Other protocols are affected too, just with different list sizes. The error is not consistent - sometimes it is EOFError
, sometimes exception is not raised, but returned unpickled value doesn't match original one.
Now, the script to find this case runs for several minutes, trying all lists of sizes up to 100K. This doesn't feel like a good fit for the otherwise quite fast test suite. Hardcoding this particular instance in a test doesn't seem to be very useful either.
What do you think, what would be the sanest way to release this?
Hardcoding this particular instance in a test doesn't seem to be very useful either.
I think it's fine -- and much better than not having any test case at all!
I've added the test and a changelog entry.
Released as 1.0.4 to PyPI, see https://pypi.org/project/zodbpickle/1.0.4/
This is an attempt to fix a data corruption issue, that manifests in a random byte replacing a correct one in a pickle, which makes a pickle unreadable.
I couldn't figure out the exact reason for it, but
max_output_len
in the fixed expression might become 0 under extreme conditions (whenoutput_len
is 0 andn
is 1). The patch fixed the issue and was tested by our team for about a month.