For a simple model:
OX=10, OY=10, IC=1, OC=3, FX=3, FY=3
If the hw architecture has 9 PEs and each has capacity==3, that is one weight, one input, one output storage, and the arch has three memory hierarchies, the L1 SRAM buffer has 20 element capacity.
Than the tool will generate the following schedule:
OX 1:10
OY 1:10
------------
------------
FY 1:3 # parallel
FX 1:3 # parallel
The access counts in L1 are calculated in tool as follow:
But this is not reasonable for weight to reload to L0, right? since this is already in L0, we don't need to reload it from SRAM again. In this case, the weight access count in SRAM should only be 9.
The reasonable case is when there are weight relevant dimensions(IC, OC, FX, FY) in L1, then the 10x10 multiplier is must.
For a simple model: OX=10, OY=10, IC=1, OC=3, FX=3, FY=3 If the hw architecture has 9 PEs and each has capacity==3, that is one weight, one input, one output storage, and the arch has three memory hierarchies, the L1 SRAM buffer has 20 element capacity.
Than the tool will generate the following schedule:
The access counts in L1 are calculated in tool as follow:
But this is not reasonable for weight to reload to L0, right? since this is already in L0, we don't need to reload it from SRAM again. In this case, the weight access count in SRAM should only be 9.
The reasonable case is when there are weight relevant dimensions(IC, OC, FX, FY) in L1, then the 10x10 multiplier is must.