zrax / pycdc

C++ python bytecode disassembler and decompiler
GNU General Public License v3.0
3.02k stars 593 forks source link

Support Python 3.8 decompilation #449

Open zrax opened 4 months ago

zrax commented 4 months ago

Tasks

TiZCrocodile commented 3 months ago

@zrax do you have an idea how to support the new while loops? because I don't see a way to infer a comparison of a while, just until the JUMP_ABSOLUTE or JUMP_BACKWARD or JUMP_BACKWARD_IF_TRUE etc. which is already the end of the while loop.

so we would need to maybe switch the if block inferred before to a while? or maybe go over all the opcodes before we process them, as a pre step to search for a while loops? for example this source code:

item = None
while item is None:
    print('inside while')

has this bytecode in 3.7:

  2           0 LOAD_CONST               0 (None)
              2 STORE_FAST               0 (item)

  3           4 SETUP_LOOP              20 (to 26)
        >>    6 LOAD_FAST                0 (item)
              8 LOAD_CONST               0 (None)
             10 COMPARE_OP               8 (is)
             12 POP_JUMP_IF_FALSE       24

  4          14 LOAD_GLOBAL              0 (print)
             16 LOAD_CONST               1 ('inside while')
             18 CALL_FUNCTION            1
             20 POP_TOP
             22 JUMP_ABSOLUTE            6
        >>   24 POP_BLOCK
        >>   26 LOAD_CONST               0 (None)
             28 RETURN_VALUE

whereas in 3.8, it's exactly the same only there is no SETUP_LOOP and POP_BLOCK, so the only thing here that differentiate it from a regular if statement, is the JUMP_ABSOLUTE or any other opcode that goes backwards like this. what way do you think we can solve this maybe?

Levak commented 2 weeks ago

As a reference:

493 : BEGIN_FINALLY

BEGIN_FINALLY is a simple LOAD_CONST NONE

Levak commented 2 weeks ago

whereas in 3.8, it's exactly the same only there is no SETUP_LOOP and POP_BLOCK, so the only thing here that differentiate it from a regular if statement, is the JUMP_ABSOLUTE or any other opcode that goes backwards like this. what way do you think we can solve this maybe?

@TiZCrocodile I just noticed your message about loops in Python 3.8 now, after scratching my head on it for hours today :D

Is it always true, that a JUMP_ABOSLUTE "backward" can only be a while loop? I haven't eaten enough bytecode to tell. The trick would be to infer a while loop when parsing the JUMP_ABSOLUTE then. Easier said than done, I know.

If condition

$ echo -e '
if i < 10:
   print(i)
' | python3.9 -m dis -
  2           0 LOAD_NAME                0 (i)
              2 LOAD_CONST               0 (10)
              4 COMPARE_OP               0 (<)
              6 POP_JUMP_IF_FALSE       16

  3           8 LOAD_NAME                1 (print)
             10 LOAD_NAME                0 (i)
             12 CALL_FUNCTION            1
             14 POP_TOP
        >>   16 LOAD_CONST               1 (None)
             18 RETURN_VALUE

While loop

$ echo -e '
while i < 10:
   print(i)
' | python3.9 -m dis -
  2     >>    0 LOAD_NAME                0 (i)
              2 LOAD_CONST               0 (10)
              4 COMPARE_OP               0 (<)
              6 POP_JUMP_IF_FALSE       18

  3           8 LOAD_NAME                1 (print)
             10 LOAD_NAME                0 (i)
             12 CALL_FUNCTION            1
             14 POP_TOP
             16 JUMP_ABSOLUTE            0
        >>   18 LOAD_CONST               1 (None)
             20 RETURN_VALUE

EDIT1: something like this maybe:

diff --git a/ASTNode.h b/ASTNode.h
index 22b90e1..d8423af 100644
--- a/ASTNode.h
+++ b/ASTNode.h
@@ -555,8 +555,10 @@ public:

     void setEnd(int end) { m_end = end; }

-private:
+protected:
     BlkType m_blktype;
+
+private:
     int m_end;
     list_t m_nodes;

@@ -575,6 +577,7 @@ public:
                  bool negative = false)
         : ASTBlock(blktype, end), m_cond(std::move(cond)), m_negative(negative) { }

+    void inferToWhileLoop() { m_blktype = ASTBlock::BLK_WHILE; }
     PycRef<ASTNode> cond() const { return m_cond; }
     bool negative() const { return m_negative; }

diff --git a/ASTree.cpp b/ASTree.cpp
index fceef40..26b6593 100644
--- a/ASTree.cpp
+++ b/ASTree.cpp
@@ -1226,6 +1246,16 @@ PycRef<ASTNode> BuildFromCode(PycRef<PycCode> code, PycModule* mod)
                                 curblock->append(tmp.cast<ASTNode>());
                             }
                         }
+                    } else if (curblock->blktype() == ASTBlock::BLK_IF) {
+                        /* in Python 3.8, SETUP_LOOP was removed for
+                           while loop too. We infer IF conditions to
+                           WHILE blocks when we meet a JUMP_ABSOLUTE */
+                        if (mod->verCompare(3, 8) >= 0)
+                        {
+                            curblock.cast<ASTCondBlock>()->inferToWhileLoop();
+                            stack = stack_hist.top();
+                            stack_hist.pop();
+
+                            PycRef<ASTBlock> tmp = curblock;
+                            blocks.pop();
+                            curblock = blocks.top();
+                            curblock->append(tmp.cast<ASTNode>());
+                        }
                     } else if (curblock->blktype() == ASTBlock::BLK_ELSE) {
                         stack = stack_hist.top();
                         stack_hist.pop();

EDIT2: There is a problem with the continue keyword. It produces a JUMP_BACKWARD to the start of the loop too. But hey, right now, it's not supported even in for-loops !

TiZCrocodile commented 2 weeks ago

@Levak oh nice, Im also glad another one is trying to contribute to this project :D and there is JUMP_BACKWARD in (list / set / dict) comprehensions also.

Levak commented 2 weeks ago

oh nice, Im also glad another one is trying to contribute to this project :D

Well, I took it as a dare because of what I'm trying to decompile was Python 3.9 and I was like "is there really no one else on the planet able to decompile Python 3.8+?"

and there is JUMP_BACKWARD in (list / set / dict) comprehensions also.

We need so much more tests - I just noticed that most of the compiled examples are for Python 2.x and were never recompiled for Python 3.x - So much features were added over time but not added to the test base?

TiZCrocodile commented 2 weeks ago

@Levak

Well, I took it as a dare because of what I'm trying to decompile was Python 3.9 and I was like "is there really no one else on the planet able to decompile Python 3.8+?"

thats exactly why i started it too xD, i wanted to prove my friend that pyc on python 3.11 is not obfuscated and i can get the code, but im now checking here and there.

We need so much more tests - I just noticed that most of the compiled examples are for Python 2.x and were never recompiled for Python 3.x - So much features were added over time but not added to the test base?

yea i agree with you, there is also wrong decompilation of some IFs conditions, we need to do more tests.

Levak commented 2 weeks ago

continue and break keywords are not supported by pycdc in recent Pythons. It seems that there was a keyword for both in the past which was removed in favor of JUMP_ABSOLUTE_A.

So far, I have identified these cases for JUMP_ABSOLUTE_A:

When I try to add while-loop inference from if blocks, everything becomes a PITA. The more I find new test cases, the more I'm sinking in quicksand.

My best example so far is this one:

while i < 10:                   while i < 10:                   
    i += 1                          i += 1                      
    while j < 10:        <|>        if j < 10:                  
        j += 1                          j += 1                  
        if i == j:                      if i == j:              
            continue                        continue            

Bytecode (Python 3.8) - condensed just to illustrate the relevant blocks

72      <...COMPARE_OP>                 0 (i<10)         72      <...COMPARE_OP>                 0 (i<10)    
78      POP_JUMP_IF_FALSE               118      <|>     78      POP_JUMP_IF_FALSE               116      
80      <...INPLACE_ADD>                                 80      <...INPLACE_ADD>                         
88      <...COMPARE_OP>                 0 (j<10)         88      <...COMPARE_OP>                 0 (j<10)    
94      POP_JUMP_IF_FALSE               72               94      POP_JUMP_IF_FALSE               72       
96      <...INPLACE_ADD>                                 96      <...INPLACE_ADD>                         
104     <...COMPARE_OP>                 2 (i==j)         104     <...COMPARE_OP>                 2 (i==j)   
110     POP_JUMP_IF_FALSE               88       <|>     110     POP_JUMP_IF_FALSE               72       
112     JUMP_ABSOLUTE                   88       <|>     112     JUMP_ABSOLUTE                   72       
114     JUMP_ABSOLUTE                   88       <|>     114     JUMP_ABSOLUTE                   72       
116     JUMP_ABSOLUTE                   72       <|

Here we can see that, respectively, from left to right, the last 2 and last JUMP_ABSOLUTE are never executed, but help highlight the while-loops. Problem is, when a POP_JUMP_IF_FALSE points to a JUMP_ABSOLUTE, Python decides to optimize it to a single POP_JUMP_IF_FALSE, which ruins block tree logic. If it was obvious that a POP_JUMP_IF_FALSE pointing to an opcode that follows a JUMP_ABSOLUTE is an obvious while-loop, a POP_JUMP_IF_FALSE that goes backward can either be a if or a while loop, inside a while loop.

I'm really starting to consider giving up Y_Y

EDIT: Discussions regarding while/continue linked with #467

Levak commented 2 weeks ago

I realize now why it is so hard to infer while loops in Python 3.8 and 3.9. The jumps are optimized by the Python compiler. For instance, POP_JUMP_IF_FALSE X where at position X there is a JUMP_ABSOLUTE Y will result in a POP_JUMP_IF_FALSE Y. This leads to dead code, as the optimized JUMP_ABSOLUTE Y will never be reached.

Here is an example: Initial code => equivalent code => disassembled bytecode => some comments The X between bytecode position and opcode means "dead code".

                                                     0       LOAD_CONST                      1: 0
                                                     2       STORE_FAST                      0: i
                                                     4       LOAD_CONST                      1: 0
                                                     6       STORE_FAST                      1: j
                                                     8       LOAD_FAST                       0: i
                                                     10      LOAD_CONST                      2: 10
                                                     12      COMPARE_OP                      0 (<)
                                                     14      POP_JUMP_IF_FALSE               64
                                                     16      LOAD_FAST                       0: i
                                                     18      LOAD_CONST                      3: 1
                                                     20      INPLACE_ADD                     
                                                     22      STORE_FAST                      0: i
                                                     24      LOAD_FAST                       1: j
                                                     26      LOAD_CONST                      4: 20
                                                     28      COMPARE_OP                      0 (<)
                                                     30      POP_JUMP_IF_FALSE               64 => 60, exit loop 2
                                                     32      LOAD_FAST                       1: j
                                                     34      LOAD_CONST                      3: 1
                                                     36      INPLACE_ADD                     
                                                     38      STORE_FAST                      1: j
                                                     40      LOAD_FAST                       0: i
while i < 10:           while i < 10:                42      LOAD_FAST                       1: j
  i += 1                  i += 1                     44      BINARY_ADD                      
  while j < 20:           while j < 20:              46      LOAD_CONST                      5: 30
    j += 1                  j += 1                   48      COMPARE_OP                      4 (>)
    if i + j > 30:          if i + j > 30:           50      POP_JUMP_IF_FALSE               24  => 56 = else
      break                   break (A)              52      JUMP_ABSOLUTE                   64  => 60 break loop 2
    else:                   else:                    54  X   JUMP_ABSOLUTE                   24  => 56 = 58 = exit else
      continue                continue (B)           56  X   JUMP_ABSOLUTE                   24  => 58 = continue inside loop 2
                            continue (C)             58  X   JUMP_ABSOLUTE                   24  => loop 2 end
  break                   break (D)                  60  X   JUMP_ABSOLUTE                   64  => break loop 1
                          continue (E)               62  X   JUMP_ABSOLUTE                   8   => loop 1 end
                                                     64      LOAD_FAST                       0: i
                                                     66      LOAD_FAST                       1: j
                                                     68      BINARY_ADD                      
                                                     70      RETURN_VALUE 

As you can see, every single keyword can be found back. This is only true up to Python 3.9 included. In Python 3.10 they refined this control flow which makes this observation null.

So the first step at inferring loops from ifs, is to probably resolve the dead links.