Open zrax opened 9 months ago
@zrax do you have an idea how to support the new while loops? because I don't see a way to infer a comparison of a while, just until the JUMP_ABSOLUTE or JUMP_BACKWARD or JUMP_BACKWARD_IF_TRUE etc. which is already the end of the while loop.
so we would need to maybe switch the if block inferred before to a while? or maybe go over all the opcodes before we process them, as a pre step to search for a while loops? for example this source code:
item = None
while item is None:
print('inside while')
has this bytecode in 3.7:
2 0 LOAD_CONST 0 (None)
2 STORE_FAST 0 (item)
3 4 SETUP_LOOP 20 (to 26)
>> 6 LOAD_FAST 0 (item)
8 LOAD_CONST 0 (None)
10 COMPARE_OP 8 (is)
12 POP_JUMP_IF_FALSE 24
4 14 LOAD_GLOBAL 0 (print)
16 LOAD_CONST 1 ('inside while')
18 CALL_FUNCTION 1
20 POP_TOP
22 JUMP_ABSOLUTE 6
>> 24 POP_BLOCK
>> 26 LOAD_CONST 0 (None)
28 RETURN_VALUE
whereas in 3.8, it's exactly the same only there is no SETUP_LOOP and POP_BLOCK, so the only thing here that differentiate it from a regular if statement, is the JUMP_ABSOLUTE or any other opcode that goes backwards like this. what way do you think we can solve this maybe?
As a reference:
BEGIN_FINALLY
BEGIN_FINALLY
is a simple LOAD_CONST NONE
whereas in 3.8, it's exactly the same only there is no SETUP_LOOP and POP_BLOCK, so the only thing here that differentiate it from a regular if statement, is the JUMP_ABSOLUTE or any other opcode that goes backwards like this. what way do you think we can solve this maybe?
@TiZCrocodile I just noticed your message about loops in Python 3.8 now, after scratching my head on it for hours today :D
Is it always true, that a JUMP_ABOSLUTE
"backward" can only be a while loop? I haven't eaten enough bytecode to tell. The trick would be to infer a while loop when parsing the JUMP_ABSOLUTE
then. Easier said than done, I know.
$ echo -e '
if i < 10:
print(i)
' | python3.9 -m dis -
2 0 LOAD_NAME 0 (i)
2 LOAD_CONST 0 (10)
4 COMPARE_OP 0 (<)
6 POP_JUMP_IF_FALSE 16
3 8 LOAD_NAME 1 (print)
10 LOAD_NAME 0 (i)
12 CALL_FUNCTION 1
14 POP_TOP
>> 16 LOAD_CONST 1 (None)
18 RETURN_VALUE
$ echo -e '
while i < 10:
print(i)
' | python3.9 -m dis -
2 >> 0 LOAD_NAME 0 (i)
2 LOAD_CONST 0 (10)
4 COMPARE_OP 0 (<)
6 POP_JUMP_IF_FALSE 18
3 8 LOAD_NAME 1 (print)
10 LOAD_NAME 0 (i)
12 CALL_FUNCTION 1
14 POP_TOP
16 JUMP_ABSOLUTE 0
>> 18 LOAD_CONST 1 (None)
20 RETURN_VALUE
EDIT1: something like this maybe:
diff --git a/ASTNode.h b/ASTNode.h
index 22b90e1..d8423af 100644
--- a/ASTNode.h
+++ b/ASTNode.h
@@ -555,8 +555,10 @@ public:
void setEnd(int end) { m_end = end; }
-private:
+protected:
BlkType m_blktype;
+
+private:
int m_end;
list_t m_nodes;
@@ -575,6 +577,7 @@ public:
bool negative = false)
: ASTBlock(blktype, end), m_cond(std::move(cond)), m_negative(negative) { }
+ void inferToWhileLoop() { m_blktype = ASTBlock::BLK_WHILE; }
PycRef<ASTNode> cond() const { return m_cond; }
bool negative() const { return m_negative; }
diff --git a/ASTree.cpp b/ASTree.cpp
index fceef40..26b6593 100644
--- a/ASTree.cpp
+++ b/ASTree.cpp
@@ -1226,6 +1246,16 @@ PycRef<ASTNode> BuildFromCode(PycRef<PycCode> code, PycModule* mod)
curblock->append(tmp.cast<ASTNode>());
}
}
+ } else if (curblock->blktype() == ASTBlock::BLK_IF) {
+ /* in Python 3.8, SETUP_LOOP was removed for
+ while loop too. We infer IF conditions to
+ WHILE blocks when we meet a JUMP_ABSOLUTE */
+ if (mod->verCompare(3, 8) >= 0)
+ {
+ curblock.cast<ASTCondBlock>()->inferToWhileLoop();
+ stack = stack_hist.top();
+ stack_hist.pop();
+
+ PycRef<ASTBlock> tmp = curblock;
+ blocks.pop();
+ curblock = blocks.top();
+ curblock->append(tmp.cast<ASTNode>());
+ }
} else if (curblock->blktype() == ASTBlock::BLK_ELSE) {
stack = stack_hist.top();
stack_hist.pop();
EDIT2: There is a problem with the continue
keyword. It produces a JUMP_BACKWARD
to the start of the loop too. But hey, right now, it's not supported even in for-loops !
@Levak oh nice, Im also glad another one is trying to contribute to this project :D and there is JUMP_BACKWARD in (list / set / dict) comprehensions also.
oh nice, Im also glad another one is trying to contribute to this project :D
Well, I took it as a dare because of what I'm trying to decompile was Python 3.9 and I was like "is there really no one else on the planet able to decompile Python 3.8+?"
and there is JUMP_BACKWARD in (list / set / dict) comprehensions also.
We need so much more tests - I just noticed that most of the compiled examples are for Python 2.x and were never recompiled for Python 3.x - So much features were added over time but not added to the test base?
@Levak
Well, I took it as a dare because of what I'm trying to decompile was Python 3.9 and I was like "is there really no one else on the planet able to decompile Python 3.8+?"
thats exactly why i started it too xD, i wanted to prove my friend that pyc on python 3.11 is not obfuscated and i can get the code, but im now checking here and there.
We need so much more tests - I just noticed that most of the compiled examples are for Python 2.x and were never recompiled for Python 3.x - So much features were added over time but not added to the test base?
yea i agree with you, there is also wrong decompilation of some IFs conditions, we need to do more tests.
continue
and break
keywords are not supported by pycdc
in recent Pythons. It seems that there was a keyword for both in the past which was removed in favor of JUMP_ABSOLUTE_A
.
So far, I have identified these cases for JUMP_ABSOLUTE_A
:
continue
AND break
(in case of nested loops)When I try to add while-loop inference from if
blocks, everything becomes a PITA. The more I find new test cases, the more I'm sinking in quicksand.
My best example so far is this one:
while i < 10: while i < 10:
i += 1 i += 1
while j < 10: <|> if j < 10:
j += 1 j += 1
if i == j: if i == j:
continue continue
Bytecode (Python 3.8) - condensed just to illustrate the relevant blocks
72 <...COMPARE_OP> 0 (i<10) 72 <...COMPARE_OP> 0 (i<10)
78 POP_JUMP_IF_FALSE 118 <|> 78 POP_JUMP_IF_FALSE 116
80 <...INPLACE_ADD> 80 <...INPLACE_ADD>
88 <...COMPARE_OP> 0 (j<10) 88 <...COMPARE_OP> 0 (j<10)
94 POP_JUMP_IF_FALSE 72 94 POP_JUMP_IF_FALSE 72
96 <...INPLACE_ADD> 96 <...INPLACE_ADD>
104 <...COMPARE_OP> 2 (i==j) 104 <...COMPARE_OP> 2 (i==j)
110 POP_JUMP_IF_FALSE 88 <|> 110 POP_JUMP_IF_FALSE 72
112 JUMP_ABSOLUTE 88 <|> 112 JUMP_ABSOLUTE 72
114 JUMP_ABSOLUTE 88 <|> 114 JUMP_ABSOLUTE 72
116 JUMP_ABSOLUTE 72 <|
Here we can see that, respectively, from left to right, the last 2 and last JUMP_ABSOLUTE
are never executed, but help highlight the while-loops. Problem is, when a POP_JUMP_IF_FALSE
points to a JUMP_ABSOLUTE
, Python decides to optimize it to a single POP_JUMP_IF_FALSE
, which ruins block tree logic. If it was obvious that a POP_JUMP_IF_FALSE
pointing to an opcode that follows a JUMP_ABSOLUTE
is an obvious while-loop, a POP_JUMP_IF_FALSE
that goes backward can either be a if or a while loop, inside a while loop.
I'm really starting to consider giving up Y_Y
EDIT: Discussions regarding while/continue linked with #467
I realize now why it is so hard to infer while loops in Python 3.8 and 3.9. The jumps are optimized by the Python compiler.
For instance, POP_JUMP_IF_FALSE X
where at position X there is a JUMP_ABSOLUTE Y
will result in a POP_JUMP_IF_FALSE Y
. This leads to dead code, as the optimized JUMP_ABSOLUTE Y
will never be reached.
Here is an example:
Initial code => equivalent code => disassembled bytecode => some comments
The X
between bytecode position and opcode means "dead code".
0 LOAD_CONST 1: 0
2 STORE_FAST 0: i
4 LOAD_CONST 1: 0
6 STORE_FAST 1: j
8 LOAD_FAST 0: i
10 LOAD_CONST 2: 10
12 COMPARE_OP 0 (<)
14 POP_JUMP_IF_FALSE 64
16 LOAD_FAST 0: i
18 LOAD_CONST 3: 1
20 INPLACE_ADD
22 STORE_FAST 0: i
24 LOAD_FAST 1: j
26 LOAD_CONST 4: 20
28 COMPARE_OP 0 (<)
30 POP_JUMP_IF_FALSE 64 => 60, exit loop 2
32 LOAD_FAST 1: j
34 LOAD_CONST 3: 1
36 INPLACE_ADD
38 STORE_FAST 1: j
40 LOAD_FAST 0: i
while i < 10: while i < 10: 42 LOAD_FAST 1: j
i += 1 i += 1 44 BINARY_ADD
while j < 20: while j < 20: 46 LOAD_CONST 5: 30
j += 1 j += 1 48 COMPARE_OP 4 (>)
if i + j > 30: if i + j > 30: 50 POP_JUMP_IF_FALSE 24 => 56 = else
break break (A) 52 JUMP_ABSOLUTE 64 => 60 break loop 2
else: else: 54 X JUMP_ABSOLUTE 24 => 56 = 58 = exit else
continue continue (B) 56 X JUMP_ABSOLUTE 24 => 58 = continue inside loop 2
continue (C) 58 X JUMP_ABSOLUTE 24 => loop 2 end
break break (D) 60 X JUMP_ABSOLUTE 64 => break loop 1
continue (E) 62 X JUMP_ABSOLUTE 8 => loop 1 end
64 LOAD_FAST 0: i
66 LOAD_FAST 1: j
68 BINARY_ADD
70 RETURN_VALUE
As you can see, every single keyword can be found back. This is only true up to Python 3.9 included. In Python 3.10 they refined this control flow which makes this observation null.
So the first step at inferring loops from ifs, is to probably resolve the dead links.
pycdc.exe ursServer.pyc Unsupported opcode: BEGIN_FINALLY (97) Unsupported opcode: BEGIN_FINALLY (97) Unsupported opcode: BEGIN_FINALLY (97) Unsupported opcode: BEGIN_FINALLY (97) Unsupported opcode: BEGIN_FINALLY (97) Warning: Stack history is not empty! Warning: block stack is not empty! Error decompyling ursServer.pyc: std::bad_cast
Tasks
BEGIN_FINALLY
END_ASYNC_FOR
CALL_FINALLY
POP_FINALLY