WIP: fork py38 / py312 pickle C extensions

tseaver commented 4 months ago

New features:

Support for new / faster protocol versions (4/5).
The 3.12 version should be near-ready to enable sub-interpreter support.

Remaining issues:

[ ] Decide whether to remove _pickle_33.c, and build only one extension module (much simpler). If so, users who require the older "preserve pickle-level compatibility for databases created under Python 2" scenario will need to pin to the last major version, forever.
[ ] Update setup.py to build the desired extension, based on Python version.
[ ] Once we've decided that, fork the 'pickletools' modules from the Python stdlib as well, and arrange for them to be imported based on Python version.
[ ] And update the tests to use the new 'pickletools', match the new protocol versions, etc.

icemac commented 4 months ago

Decide whether to remove _pickle_33.c, and build only one extension module (much simpler). If so, users who require the older "preserve pickle-level compatibility for databases created under Python 2" scenario will need to pin to the last major version, forever.

Databases created under Python 2 should have been migrated to Python 3 using https://pypi.org/project/zodbupdate/ – at least this is what we have promoted to do to upgrade existing applications to Python 3. It is not even possible to open a not converted database on Python 3 – it gets rejected because of a different magic code in the first bytes of the database file. From my understanding there should be no need to keep that Python 2 compat scenario – but maybe I am misunderstanding something here.

navytux commented 4 months ago

Hello everyone,

For the reference: in the context of ERP5 we do load pickles generated on py2 under py3 instead of migrating the database. Quoting https://lab.nexedi.com/nexedi/pygolang/-/merge_requests/21#note_205356 :

Please see pickle_py2_gpy3_demo.py and ZODB_py2_gpy3_demo.py there for explanation + demonstration of how py2/py3 pickle compatibility problem is solved.

...

# However when run under gpy3, the string is loaded ok as bstr. Since bstr has the
# same semantic as regular str on py2, working with that object produces the
# same result plain py2 would produce when adjusting the data. And then, bstr
# is also persisted ok and via the same *STRING opcodes, that py2 originally
# used for the data.
#
# This way both py2 and gpy3 can interoperate on the same database: py2 can
# produce data, gpy3 can read the data and modify it, and further py2 can load
# updated data, again, just ok.

From this point of view it would be handy to retain support for STRING opcodes.

Thanks beforehand, Kirill

P.S. Please see https://lab.nexedi.com/nexedi/pygolang/-/merge_requests/21 for the whole context of this if you are interested.

tseaver commented 4 months ago

@icemac

Databases created under Python 2 should have been migrated to Python 3 using https://pypi.org/project/zodbupdate/ – at least this is what we have promoted to do to upgrade existing applications to Python 3.

Interesting: I'd forgotten that tool existed.

It is not even possible to open a not converted database on Python 3 – it gets rejected because of a different magic code in the first bytes of the database file. From my understanding there should be no need to keep that Python 2 compat scenario – but maybe I am misunderstanding something here.

I'd be glad to just git rm src/zodbpickle/_pickle_33.c if so, and just pick the 38 or 312 versions in setup.py based on the Python version.

tseaver commented 4 months ago

@navytux

From this point of view it would be handy to retain support for STRING opcodes.

I just looked, and the _pickle_38.c and _pickle_312.c modules in this draft preserve the same semantics for reading the STRING and opcode as the current_pickle_33.c`:

First, they strip the required outer quotes abd convert the remainder to bytes (using PyBytes_DecodeEscape)
If self->encoding is set to bytes (that is the Py2 compatibility option), they push the bytes object on the stack.
Otherwise, they covert to str (using PyUnicode_FromEncodedObject) and push that object.

For writing, the new modules lose the ability to emit the STRING opcode:

for protocol < 3, they fall back to the old "fake reduce" strategy, which should be consumable from Python2.

Examining the _pickle_33.c version, I don't believe that this branch emitting STRING opcode is actually reachable: if (!self->bin) implies self->protocol == 0, but the prior section is guarded by if (self->protocol < 3).

navytux commented 4 months ago

@tseaver, thanks for feedback.

Maybe when originally replying I misunderstood a bit the following:

Decide whether to remove _pickle_33.c, and build only one extension module (much simpler). If so, users who require the older "preserve pickle-level compatibility for databases created under Python 2" scenario will need to pin to the last major version, forever.

I understand and agree that all _pickle_33, _pickle_38 and _pickle_312 handle *STRING opcodes on loading essentially the same way: out of the box the bytestring is decoded to unicode string, or is left as bytes if Unpickler is created with encoding='bytes'. From this point of view it should be indeed ok to drop _pickle33 and leave only the rest. But then could you please clarify what did you mean with "If so, users who require the older "preserve pickle-level compatibility for databases created under Python 2" scenario will need to pin to the last major version, forever."_ ?

For the reference I remembered that my bstr already works ok with respect to pickle loading and saving on stdlib pickle from py3.11, so dropping py3.3-based cpickle should be likely ok.

Your outline of how loading works and how saving does not work for STRING opcodes is correct, but it applies to unmodified stdlib pickle and zodbpickle behaviours.

However on my side loading and saving *STRING work ok out of the box, and in completely backward compatible way with respect to py2, because I patch pickle modules at runtime - both py and C versions - to handle that well. For example the loading is implemented via using Unpickler(encoding='bstr') and returning bstr(byte-data) for *STRING opcodes:

https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L374-393 https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L475-497 https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/golang_str_pickle_test.py#L73-127

while for saving the code is adjusted to emit bstr objects back via those same *STRING instructions:

https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L395-423 https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L635-673 https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/golang_str_pickle_test.py#L129-185

This way what py2 saved as its str is loaded on py3 as bstr object which has semantic compatible to bytestring from py2.

In total there are 3 types in this scheme:

1) raw bytes (represented by bytes on py3 and zodbpickle.binary on py2), 2) binary string (represented by bstr on py3 and str on py2), and 3) unicode string (represented by str on py3 and unicode on py2)

The binary string and unicode string are just two different representations of the same "string" concept. So being two different representations of the same entity, contrary to "raw bytes", they are interoperable with each other without explicit conversion: for example bstr + unicode works seamlessly out of the box as well as all other string operations do automatic coercion as needed.

And for the reference: even though intended encoding for binary string representation is UTF-8, it still does work ok even when binary data is invalid UTF-8.

My primary message here is that there still are users that care about backward compatibility for py2 pickles, and there is a way to keep that compatibility working completely and transparently without database migration.

Kirill

zopefoundation / zodbpickle

WIP: fork py38 / py312 pickle C extensions #92