bug(cost model): Weight reuse counts calculation

For a simple model: OX=10, OY=10, IC=1, OC=3, FX=3, FY=3 If the hw architecture has 9 PEs and each has capacity==3, that is one weight, one input, one output storage, and the arch has three memory hierarchies, the L1 SRAM buffer has 20 element capacity.

Than the tool will generate the following schedule:

OX 1:10
    OY 1:10
------------
------------
FY 1:3 # parallel
    FX 1:3 # parallel

The access counts in L1 are calculated in tool as follow:

12*12 (input) + 10*10 (output) + 10*10*9 (weight) = 1144

But this is not reasonable for weight to reload to L0, right? since this is already in L0, we don't need to reload it from SRAM again. In this case, the weight access count in SRAM should only be 9.

The reasonable case is when there are weight relevant dimensions(IC, OC, FX, FY) in L1, then the 10x10 multiplier is must.

xuanyoya / Interstellar-CNN-scheduler

bug(cost model): Weight reuse counts calculation #18