Issues with ppci-cc backend tree generation

windelbouwman / ppci

A compiler for ARM, X86, MSP430, xtensa and more implemented in pure Python

https://ppci.readthedocs.io/en/latest/

BSD 2-Clause "Simplified" License

335 stars 36 forks source link

Issues with ppci-cc backend tree generation #45

Open tstreiff opened 4 years ago

tstreiff commented 4 years ago

Hello, I am trying to figure out the effort for writing a backend for a simple CPU. For this I compile simple functions and display the corresponding back-end tree. For a the C function char strcpy(char d, char s) { char save = d; while (d++ = s++); return save; } This produces the following trees (to be matched by the BURG patterns) : Generation tree : MOVU64vreg5 Generation tree : MOVU64vreg6 Generation tree : MOVU64vreg0phi_s_alloc_0 Generation tree : MOVU64vreg1phi_d_alloc_0 Generation tree : JMP[strcpy_block1:] Generation tree : MOVU64vreg0phi_s_alloc_0 Generation tree : MOVI8vreg9load_4 Generation tree : MOVU64vreg1phi_d_alloc_0 Generation tree : STRI8(REGU64[vreg1phi_d_alloc_0], REGI8[vreg9load_4]) Generation tree : MOVU64[vreg7](ADDU64(REGU64[vreg0phi_s_alloc_0], CONSTU64[1])) Generation tree : MOVU64[vreg8](ADDU64(REGU64[vreg1phi_d_alloc_0], CONSTU64[1])) Generation tree : MOVU64vreg0phi_s_alloc_0 Generation tree : MOVU64vreg1phi_d_alloc_0 Generation tree : CJMPI64[('==', strcpy_block3:, strcpy_block1:)](I8TOI64(REGI8[vreg9load_4]), CONSTI64[0]) Generation tree : MOVU64vreg4retval Generation tree : JMP[strcpy_epilog:]

This raises some issues (assuming I have correctly understood how all this works) :

1) Two trees are a copy of a vreg to itself (tree 6 and 8), the backend can check for such cases that but it is better if thies MOVE are not generated

2) This makes a lot of copies... Even if after code selection, register allocation will try to reuse the real registers, each MOVE will generate instructions whatever the final real registers are (this is reflected in the corresponding x86-64 code)

3) There is no way in the backend to know when a vreg is no longer needed (last use), this means that the corresponding real register will not be deallocated til the end of the function. Precise def-use information are not necessary but the number of times a vreg is used would be nice.

4) When playing with the -O option, I have seen that in -O 0, the compiler assumes that function arguments come in registers and first saves them on the stack. This is OK especially for debugging. But if the target architecture transfers the arguments thru the stack there will be a 2nd copy of the arguments. The code generator should use the results of the "determine_arg_locations" and if all args are already on the stack, should not generate another memory copy.

5) The CJMP instruction has 2 labels (yes/no). This is perfect at IR level. However when generating code, this often produces useless jumps to following instructions : jmp label3 label3: ... The x86_64 backend does this all the time. The problem is that the backend does not know what label will come just after the CJMP (blocks are not always generated from 0 to n, for example if we reverse loops). A peephole optimizer could remove this, but it would be better not to produce this at code generation.

These are several points on a single issue report but as there is no much documentation I could be wrong on the interpretation.

Anyway I am impressed by the job done! that's Python productivity at work...

pfalcon commented 4 years ago

The x86_64 backend does this all the time.

All other backends too. Actually, I got an impression that x86_64 code has fewer of them than less developed archs. I wondered if x86_64 exactly may have some kind of peepholer. I agree this should be done in arch-independent way, even before codegeneration.

windelbouwman commented 4 years ago

Thanks for the thorough review!

Two trees are a copy of a vreg to itself (tree 6 and 8), the backend can check for such cases that but it is better if thies MOVE are not generated

Partly agreed, this instruction selection scheme works well with the implemented register allocator which removes moves from the code after instruction selection.

This makes a lot of copies... Even if after code selection, register allocation will try to reuse the real registers, each MOVE will generate instructions whatever the final real registers are (this is reflected in the corresponding x86-64 code)

No, the moves will be removed physically from the code by the current register allocator. Did you check the resulting code with ppci_explorer.py? Btw, it might make sense to make this tree output available in ppci_explorer.py as well.

There is no way in the backend to know when a vreg is no longer needed (last use), this means that the corresponding real register will not be deallocated til the end of the function. Precise def-use information are not necessary but the number of times a vreg is used would be nice.

Actually, there is way for this, there is an instruction called RegisterUseDef which serves exactly the purpose of using / defining variables so that their lifetimes are limited, otherwise the variables would be used until the end of the function.

When playing with the -O option, I have seen that in -O 0, the compiler assumes that function arguments come in registers and first saves them on the stack. This is OK especially for debugging. But if the target architecture transfers the arguments thru the stack there will be a 2nd copy of the arguments. The code generator should use the results of the "determine_arg_locations" and if all args are already on the stack, should not generate another memory copy.

Yes, fully agreed.

The CJMP instruction has 2 labels (yes/no). This is perfect at IR level. However when generating code, this often produces useless jumps to following instructions : jmp label3 label3: ... The x86_64 backend does this all the time. The problem is that the backend does not know what label will come just after the CJMP (blocks are not always generated from 0 to n, for example if we reverse loops). A peephole optimizer could remove this, but it would be better not to produce this at code generation.

There actually is some limited form of peephole optimization done to remove jumps to labels right after the jump. Agreed that it would be better to not generate the jump in the first place, but until now I was unable to implement this as such.

tstreiff commented 4 years ago

I was unaware that the RegisterAllocator could suppress instructions. I thought it just transformed vritual registers to real registers and could just add spill-out and spill-in code. I quickly checked how all this works and how "move" target instructions are recognized (instructions category is in fact given in the instructions description). I have seen the special instruction RegisterUseDef but have not found yet how it is used.

When targeting an accumulator CPU with very few registers, each vreg ends up in a stack temporary since we do not know when generating the MOVEs if the vreg is still alive.

About CJMP and the useless branch instructions An easy solution (for the backend) would be to provide the next block label in the CJMP node. Another (more complex) solution would be to make the backend generate the code label themselves. If a template existed for this (a new Template node or the existing LABEL node with goal "stm"), the backend could delay the emission of the branch instructions until it knows which label comes next. If it is neither yes nor nolabel, it generates a conditional and a unconditional branch instruction. Otherwise it générâtes only one conditional branch, reversing it if required.

pfalcon commented 4 years ago

Another (more complex) solution would be to make the backend generate the code label themselves

IMHO, matters like this should be dealt with in middle-end, on IR level, without burdening (arch-specific) backend with that. Middle-end should perform basic block scheduling (aka trace scheduling). Fine point is that it should be done in a way to allow flexibility, i.e. using different discipline/criteria for such scheduling (e.g., likely/unlikely annotations). And perhaps, there should be simply a pass before codegeneration which lower there "CJMP" if/then/else instructions into if/then as used by most CPUs.

windelbouwman commented 4 years ago

When targeting an accumulator CPU with very few registers, each vreg ends up in a stack temporary since we do not know when generating the MOVEs if the vreg is still alive.

Attention for this, I faced this issue as well when implementing the mcs6500 architecture. This CPU is basically a stack machine, and the algorithm for instruction selection/register allocation currently in use might not be the best choice for stack machines. I would like to better be able to support stack machines.

Which CPU are you targetting?

windelbouwman commented 4 years ago

And perhaps, there should be simply a pass before codegeneration which lower there "CJMP" if/then/else instructions into if/then as used by most CPUs.

This makes sense, I should probably generate appropriate selection trees. Point to make here that there are basically two IR types, the IR as specified in ppci.ir and the IR for instruction selection, with the trees like MOVU8(REGU8, CONSTU8(100)).

pfalcon commented 4 years ago

Point to make here that there are basically two IR types, the IR as specified in ppci.ir and the IR for instruction selection

Good to know. The point I'm going to promote that as much as possible various transformations should be done on the level of the main IR (ppci.ir), in machine-independent way. That may be problematic, because some decisions are actually machine-dependent. Let me provide example. Currently, PPCI generates overlong instructions for simple things like "mov rax, 0". There would be 2 ways to deal with that:

Do it on the level of backend, with instruction selection pattern like MOV(REG, 0) => "xor REG, REG".
Or, have such a pass on the level of main IR, well, perhaps on the main IR, but on the level of the lowered form of main IR, "tmp1 = 0" => "tmp1 = tmp ^ tmp1".

Way 1 means repeating exercise again and again for each backend, so it's wrong IMHO. Them way 2 is right. But what if there's a backend which doesn't support "xor reg, reg" or where it's only more expensive? Well, such a pass should be made configurable, driven by declarative info provided by particular backend (with ability to override by a user).

tstreiff commented 4 years ago

I am currently trying to write a backend for a small CPU/MPU (Motorola/Freescale 68xx). I have an old (but strong) backend for 6809 acting on a C AST. It is part of a K&R C compiler I wrote years ago and that I translated in Python last year.

More remarks on this process:

1) Obtaining a working backend is rather easy but I think it is hard to get something efficient: for this king of small CPUs, the backend has not access to sufficient information to make efficient choices.

2) Even with the help of existing backends and the Pyhton source comments, the backend "tree interface" is hard to guess. I tried to write down what I discovered and it can be the skeleton of a document (use of the "stm" goal, what data are available for each node, what are the parameters passed to the pattern functions depending on the tree nature, PTR types not used in the patterns, etc.)

3) The backend need to process many useless subtrees :

Null conversions : I16TOI16, U16TOU16 : it is easy to do (generate nothing) but it is a burden for each backend writer. Conversion of integer constants is most often useless (eg. I16TOI8(CONSTI16(0))
Trivial simplifications: multiplications and divisions by 1, constants + constantes, constants * constants, minus (constant), etc.
moves with the same vreg as source and destination (already mentioned above) but even if they could be removed at register allocation, it would be best to never generate them

4) The CALL node is a special case. Il is not clear why the generation code is in arch.py (and Arch class) instead of being a pattern. Several problems here:

The IR to trees algorithm does not generate a tree containing the argument values and the CALL node. Instead it generates a tree for each argument, storing each in a vreg, and then call the "Arch.gen_call" function. This function is in charge of dealing with the calling convention by either (depending on the target ABI) copying the vregs in real regs or pushing them on the stack. For an architecture with parameter passing thru the stack, there is no way to recognise that a tree evaluates a parameter. we have to generate code to evaluate each argument value and store it somewhere (in a stack temporary). Later on, when "gen_call" is called, each temporary has to be loaded and pushed on the stack. Generating good code is therefore hard.
I haven't understood how the "rv" (return value) parameter of "gen_call" works. Even if the function call return value is not used, "rv" contains a vreg and gen_call has to generate a useless move from the function return real register to the vreg.

5) I think that there should be code generation "hooks" in the architecture class so that the back end authors can choose to use their own Python code.

6) I haven't found an easy way to make the "codegen" display the trees to be matched for a function. This is very useful for a backend writer, and this notation is rather easier to read than the IR itself.

windelbouwman commented 4 years ago

This is truly good news, I already started the m68k target, but the m68xx appears to be more familiar to the 6502 and STM8 processors. These are accumulator machines, and I'm not too happy about the ppci support for those types of machines.

I would propose to add more arch hooks as proposed before.

Also, the documentation about the instruction selector should be improved. Having some start on this would be great.

I haven't found an easy way to make the "codegen" display the trees to be matched for a function. This is very useful for a backend writer, and this notation is rather easier to read than the IR itself.

Did you manage to generate a HTML report? The trees are listed, as are the instruction before register allocation. You can use the HtmlReportGenerator and pass this as a reporter keyword argument to many functions. For example: https://github.com/windelbouwman/ppci-mirror/blob/master/examples/riscvpicorv32/mkfwc.py#L23

* Null conversions : I16TOI16, U16TOU16 : it is easy to do (generate nothing) but it is a burden for each backend writer. Conversion of integer constants is most often useless (eg. I16TOI8(CONSTI16(0))

This probably could be fixed rather easily during generation of the selection trees. I could have a look at this. Not generating these useless nodes will probably benefit at many areas, since it will generate less cruft.

windelbouwman commented 4 years ago

Regarding the null conversions, I added some code to prevent the obvious conversions. This already saves quite some code, the specific code for checking whether a conversion is applicable lives here: https://github.com/windelbouwman/ppci-mirror/blob/master/ppci/codegen/irdag.py#L463

Conversion like I32TOI32 should be removed. Also, conversions like I32TOU32 should not be generated when register classes are equal (when I32 and U32 use the same type of register).

How about narrowing and widening conversions? I think both need some code generated in the backend right?

For example: I32TOI16 might need another register class, or it might perform a sign extend in a 32 bits register.

Vice-versa, I16TOI32 needs a sign extension or a move to a wider register.

tstreiff commented 4 years ago

Good! I think that only the null conversions need to be removed. Widening usually requires code to zero or sign-extend. Narrowing is most often simpler, but we should keep it for generality (just in case a target requires it one day) This is the usual trade-off between generality and simplicity.