zrax / pycdc

C++ python bytecode disassembler and decompiler
GNU General Public License v3.0
3.19k stars 617 forks source link

Unsupported opcode: <INVALID> (bytecode=A6h) at position 36. #514

Open Niocas opened 2 weeks ago

Niocas commented 2 weeks ago

Unsupported opcode: (bytecode=A6h) at position 36.

I am trying to decompile a python 3.12 .pyc file. But it fails for nearly all files at bytecode "A6h". How can I possibly fix that? I wrote python script with python 3.12 and imported opcode to print all opcodes, but it seems like that are not all of them? What am I missing here, how can I fix the decompiling process?

user@Windows-11-Pro:/mnt/c/Users/user/OneDrive/Desktop/oh-data/pycdc$ pycdc item_data_2.pyc
# Source Generated with Decompyle++
# File: item_data_2.pyc (Python 3.12)

Unsupported opcode: <INVALID> (bytecode=A6h) at position 36.
import bindict
# WARNING: Decompyle incomplete
Niocas commented 2 weeks ago

item_data_pyc.zip

Here is one of the files I am trying to decompile.

Niocas commented 2 weeks ago

image

I added it to the pythonb_3_12.cpp now, But now the output look like this when executing: " pycdc item_data_2.pyc". Any ideas?

Niocas commented 2 weeks ago

pycdas item_data_2.pyc outputs the following: image

greenozon commented 2 weeks ago

there is opcode 166 in your pyc - it is not legal one, from cpython include/opcode.h: (Python 3.12)

image

wilson0x4d commented 2 weeks ago

the direct answer here is: fixing <INVALID> from pycdc requires implementing a decompilation strategy in ASTree.cpp for the specific opcode/instruction, which is non-trivial. you can add the opcode to the case statement in ASTree.cpp just to get the tool to be quiet but it often results in incorrect/incomplete python output.

examples of opcodes blocking successful python code generation (from "OH" pycs) include:

Since getting an ultra-trivial merge for a PR proved impossible (#511 - nothing more than "testing the waters" here) I forked and stopped trying to work with pycdc devs, based on the title this message is coming from that fork.. that means you're also going to battle the fact that the original repo doesn't have complete opcode maps (pycdas doesn't produce 100% correct results for 3.11 nor 3.12) and you may be asking devs to implement/investigate something they haven't support for yet in the main repo.

for example, according to pycdc main repo "166" is not a valid opcode, but we can see that it is "UNPACK_SEQUENCE_TUPLE" from cpython source code.

#define UNPACK_SEQUENCE_TUPLE                  166

you can see the response from @greenozon illustrating the problem you are going to face here.

i'm trying to be kind about this problem. the fact is we have binaries in the wild which contain opcodes which the pycdc project denies exist.

as for the code you're reversing, in most cases the modules containing bindict have no useful code, they contain a bindict and a call out to a native bindict module that i've not been able to locate (possibly is packed inside the 50MB main exe, it doesn't exist anywhere in the pyc's) -- the bindict format is essentially a table similar to NXFNs along with a trailing binary blob (which is not consistent between bindicts, which means it must be contextual.) to illustrate what i mean, consider this pycdas result from another bindict file:

        0       RESUME                          0
        2       LOAD_CONST                      0: 0
        4       LOAD_CONST                      1: None
        6       IMPORT_NAME                     0: bindict
        8       STORE_NAME                      0: bindict
        10      PUSH_NULL                       
        12      LOAD_NAME                       0: bindict
        14      LOAD_ATTR                       0: bindict
        34      LOAD_CONST                      2: b'\x01\x00\x00\x00\x00\x00\x00\x00\x13\x00\x00\x00abnormal_item_state\x0c\x00\x00\x00\x00\x01\x00\x00\x01\x96\x05\x02v\x01\x0b\x01\x0f\x17\xfd8\x18\x00\x00\x00\x89\xc0\x95\x12\t\x00'
        36      UNPACK_SEQUENCE_TUPLE           1
        40      CALL                            1
        50      STORE_NAME                      1: data
        52      LOAD_CONST                      1: None
        54      RETURN_VALUE                    

you can see this is basically just calling bindict.bindict(...) passing in the constant bytes/string shown in the disasm. this is basically the same in all files containing bindict data.

the approximate py output from pycdc (if it were actually implemented rather than being denied) would look something like this:

# WIP opcode: UNPACK_SEQUENCE_TUPLE (bytecode=A6h) at position 36.
# Source Generated with Decompyle++
# File: abnormal_capture_rate_data.do.pyc (Python 3.12)

import bindict
data = bindict.bindict(b'\x07\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00 \x00\x00\x000\x00\x00\x00?\x00\x00\x00O\x00\x00\x00_\x00\x00\x00h\x00\x00\x00settlement_rate2settlement_rate4settlement_rate3max_capture_nummust_succeed_numsettlement_rate1init_rateG\x01\x00\x00\x00\x00\x02\x02\x01\x01\x06\x03\x04\x05\n\x06\x00\x12\x00"\x02\x12\x02"\x01\x12\x01"\x06"\x03\x01\x04\x01\x05"\x96\x0e*\xfc\xa9\xf1\xd2Mb`?\xfa~j\xbct\x93h?{\x14\xaeG\xe1zt?\xfc\xa9\xf1\xd2MbP?\x04c\xfc\xa9\xf1\xd2MbP?\x96\x0e\x15\x00\x00\x80?\x00\x00\x00\x00\x00\x00\x00\x00ffffff\xe6?\x02\x02\x9a\x99\x99\x99\x99\x99\xe9?\x96\x0e*{\x14\xaeG\xe1z\x84?\xb8\x1e\x85\xebQ\xb8\x8e?\x9a\x99\x99\x99\x99\x99\x99?\xfa~j\xbct\x93h?\x04c{\x14\xaeG\xe1zt?\x96\x0e\x1a\x9a\x99\x99\x99\x99\x99\xe9?\xcd\xcc\xcc\xcc\xcc\xcc\xec?\x00\x00\x00\x00\x9a\x99\x99\x99\x99\x99\xd9?\x03c333333\xe3?\x96\x0e\x1a\x9a\x99\x99\x99\x99\x99\xa9?333333\xb3?\x00\x00\x00>{\x14\xaeG\xe1zt?\x04c\x9a\x99\x99\x99\x99\x99\x99?\x96\x0e*333333\xe3?\x9a\x99\x99\x99\x99\x99\xe9?\xcd\xcc\xcc\xcc\xcc\xcc\xec?\x9a\x99\x99\x99\x99\x99\xc9?\x04c\x9a\x99\x99\x99\x99\x99\xd9?\x96\x0e\x1a\x9a\x99\x99\x99\x99\x99\xc9?333333\xd3?\x00\x00\x00?\x9a\x99\x99\x99\x99\x99\xa9?\x04c\x9a\x99\x99\x99\x99\x99\xb9?f\x0b\x07\x00\x00\x00\x00\x93\x01\x00\x00\x1bc\r4\x97\x01\x00\x006\xc6\x1ah\x8f\x01\x00\x00\xc99\xe5\x97\x85\x01\x00\x00R)(\x9c\x88\x01\x00\x00\xe4\x9c\xf2\xcb\x8b\x01\x00\x00m\x8c5\xd0\x82\x01\x00\x00\x11\x07$\x01\x02Q\x11\x05r\x01\x01\x9f\x01\x11\x03\xc8\x01\x01\x00\xf1\x01\x11\x01\x9e\x02\x00')

anyway, the short answer is resolving the issue requires updating ASTree.cpp (after fixing the incomplete opcode maps.)

wilson0x4d commented 2 weeks ago

@greenozon you might find this of interest:

https://github.com/wilson0x4d/pycdc/blob/wip/bytes/python_3_11.cpp

https://github.com/wilson0x4d/pycdc/blob/wip/bytes/python_3_12.cpp

i see no reason to not have entries for any opcode appearing in official cpython, it actually works against pycdc maintainers and its end-users trying to figure out what to keep and what to remove, and it causes no harm in having entries that cpython's compile(...) would not produce, the mere fact the opcode has representation in cpython source code at any point during the lifetime of a given version/branch is sufficient reason to be including them (IMHO)

wilson0x4d commented 2 weeks ago

i also have ASTree implementation code for a half dozen ops not pushed to my wip branch. would love if i could work with people that understand how to work with the ast stack and frame logic better than i do.