Recombine FS inputs into vectors

anarsoul commented 6 years ago

NIR splits outputs and inputs in nir_lower_io_to_scalar_early() and unfortunately for FS it means increased number of instructions and increased register pressure.

We need to recombine inputs from scalar to vectors.

See https://www.mail-archive.com/mesa-dev@lists.freedesktop.org/msg189216.html

yuq commented 6 years ago

Right, I also meet this problem after 18.0 rebase. But due to want to focus on kernel, I haven't solve it from the root. Now I'm doing 18.1 rebase, so will cover these NIR changes in a better way.

anarsoul commented 6 years ago

kmscube -M rgba fails in lima-18.0 branch due to this issue with "ppir: ppir: regalloc fail"

anarsoul commented 6 years ago

impl main {
    decl_reg vec4 32 r0
    decl_reg vec2 32 r1
    block block_0:
    /* preds: */
    vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
    vec1 32 ssa_1 = intrinsic load_input (ssa_0) () (1, 0) /* base=1 */ /* component=0 */   /* vVaryingColor */
    vec1 32 ssa_2 = intrinsic load_input (ssa_0) () (1, 1) /* base=1 */ /* component=1 */   /* vVaryingColor */
    vec1 32 ssa_3 = intrinsic load_input (ssa_0) () (1, 2) /* base=1 */ /* component=2 */   /* vVaryingColor */
    vec1 32 ssa_4 = intrinsic load_input (ssa_0) () (1, 3) /* base=1 */ /* component=3 */   /* vVaryingColor */
    r0.x = imov ssa_1
    r0.y = imov ssa_2.x
    r0.z = imov ssa_3.x
    r0.w = imov ssa_4.x
    vec1 32 ssa_6 = intrinsic load_input (ssa_0) () (0, 0) /* base=0 */ /* component=0 */   /* vTexCoord */
    vec1 32 ssa_7 = intrinsic load_input (ssa_0) () (0, 1) /* base=0 */ /* component=1 */   /* vTexCoord */
    r1.x = imov ssa_6
    r1.y = imov ssa_7.x
    vec4 32 ssa_9 = tex r1 (coord), 0 (texture) 0 (sampler)
    vec4 32 ssa_10 = fmul r0, ssa_9
    intrinsic store_output (ssa_10, ssa_0) () (0, 15, 0) /* base=0 */ /* wrmask=xyzw */ /* component=0 */   /* gl_FragColor */
    /* succs: block_0 */
    block block_0:
}

========prog========
-------block------
const 0 ssa0
st_col 15 new
  mul 14 ssa10
    mov 5 reg0
      ld_var 1 ssa1
    mov 6 reg0
      ld_var 2 ssa2
    mov 7 reg0
      ld_var 3 ssa3
    mov 8 reg0
      ld_var 4 ssa4
    ld_tex 13 ssa9
      mov 11 reg1
        ld_var 9 ssa6
      mov 12 reg1
        ld_var 10 ssa7
====================
ppir: ppir_lower_texture create load_coords node 16 for 13
========prog========
-------block------
st_col 15 new
  mul 14 ssa10
    mov 5 reg0
      ld_var 1 ssa1
    mov 6 reg0
      ld_var 2 ssa2
    mov 7 reg0
      ld_var 3 ssa3
    mov 8 reg0
      ld_var 4 ssa4
    ld_tex 13 ssa9
      ld_coords 16 new
        mov 11 reg1
          ld_var 9 ssa6
        mov 12 reg1
          ld_var 10 ssa7
====================
ppir: node_to_instr create move 17 from store 15
ppir: insert_load_tex: create move 18 for 13
======ppir instr list======
      vary texl unif vmul smul vadd sadd comb stor const0|1
*000: null null null 14   null 17   null null null | 
 001: null null null null null null 5    null null | 
 002: 1    null null null null null null null null | 
 003: null null null null null null 6    null null | 
 004: 2    null null null null null null null null | 
 005: null null null null null null 7    null null | 
 006: 3    null null null null null null null null | 
 007: null null null null null null 8    null null | 
 008: 4    null null null null null null null null | 
 009: 16   13   null null null 18   null null null | 
 010: null null null null null null 11   null null | 
 011: 9    null null null null null null null null | 
 012: null null null null null null 12   null null | 
 013: 10   null null null null null null null null | 
------------------------
======ppir instr depend======
[0[1[2]][3[4]][5[6]][7[8]][9[10[11]][12[13]]]]
------------------------
ppir: ppir: regalloc fail

yuq commented 6 years ago

In fact, even we don't recombine scalar to vector, ppir should not fail, but only generate longer code. This "regalloc fail" is indeed the ppir need to implement reg spill when out of regs.

anarsoul commented 6 years ago

As far as I understand we need to reverse engineer how temporaries work first. I see store and load temporary instructions in the doc, but I don't understand where they're stored.

yuq commented 6 years ago

I guess it's here: https://github.com/yuq/mesa-lima/blob/lima-18.0/include/drm-uapi/lima_drm.h#L107

Each PP can have a memory stack which I guess is used to store tmp.

anarsoul commented 6 years ago

There're 2 stack_address, one in lima_pp_frame_reg, another in drm_lima_m400_pp_frame/drm_lima_m450_pp_frame.

I'm not sure what's the difference between them.

yuq commented 6 years ago

lima_pp_frame_reg one is dummy, drm_lima_m400_pp_frame one is used, one for each PP.

anarsoul commented 6 years ago

@yuq how do you tell which one is dummy?

yuq commented 6 years ago

In this function: https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L303

LIMA_PP_FRAME & LIMA_PP_STACK are per PP, so the lima_pp_frame_reg will be set to new value before task start.

anarsoul commented 6 years ago

@yuq, what about LIMA_PP_STACK_SIZE?

anarsoul commented 6 years ago

And why LIMA_PP_STACK is not used for mali450?

yuq commented 6 years ago

LIMA_PP_STACK is used by mali450 here: https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L321

Just because bcast will set same address to all PPs, so I have to set it individually for each PP.

LIMA_PP_STACK_SIZE is same for all PPs, so the lima_pp_frame_reg one is not dummy.

enunes commented 6 years ago

I'll have some time to work on lima again starting by the end of this week. If nobody is currently working on this, I could pick it up and work on it.

From what I understand there are two issues, 1) recombine scalars to vec4 to restore the earlier behavior and 2) implement register spilling to avoid this kind of problem in the future

Is that correct? Anybody working in any of these?

Just to be sure though, are we sure already that we want (1) (as in the title of this issue), even with (2) implemented? Or is it something that we would need to benchmark?

yuq commented 6 years ago

I think you are right. 2 is needed anyway for correctness and 1 is for better performance. But 1 needs to be done also because I wrote ppir for vec4, not for scalar, some refine or recombine may be needed for correctness.

yuq / mesa-lima

Recombine FS inputs into vectors #56