xmos / ai_tools

AI applications and tools
Other
24 stars 10 forks source link

Add support for splitting tensor arena into persistent/non-persistent arenas #908

Open andresovela opened 1 month ago

andresovela commented 1 month ago

When the tensor arena requirements of a given model are larger than the available SRAM, the tensor arena has to be placed in external RAM, leaving performance on the table due to being unable to use SRAM as scratch memory at all.

I created https://github.com/tensorflow/tflite-micro/issues/2627 to ask TFLM to support this use case, but it doesn't seem to be happening any time soon. However, according to a collaborator, it's already possible to split the tensor arena into persistent/non-persistent arenas.

It seems that in order to support this use case, we would need to add this functionality to xformer. I can do it myself if I get some guidance.

This would allow applications to place the non-persistent arena in SRAM and the persistent arena in external RAM, or vice versa, which would allow models to perform better on the xcore.ai platform.

andresovela commented 1 month ago

I was able to test the persistent/non-persistent split feature and I think it can be very useful in some cases. I tested it with a model with the following arena sizes:

In this particular case, the entire arena would actually fit in SRAM, so this test is kind of pointless, but I wanted to see what kind of performance would we be able to achieve for a model where a split would actually make sense.

Entire tensor arena in SRAM

Constraint check for tile[0]:
  Memory available:       524288,   used:      14848 .  OKAY
    (Stack: 1116, Code: 11036, Data: 2696)
Constraints checks PASSED WITH CAVEATS.
Constraint check for tile[1]:
  Memory available:       524288,   used:      240436 .  OKAY
    (Stack: 4164, Code: 49020, Data: 187252)
Constraints checks PASSED WITH CAVEATS.

Cumulative times for invoke()...
53    OP_XC_ld_flash                   2659902      26.60ms
14    OP_XC_conv2d_v2                  80202        0.80ms
4     OP_UNPACK                        3227         0.03ms
4     OP_SPLIT                         9783         0.10ms
20    OP_XC_add                        12066        0.12ms
21    OP_XC_lookup                     12327        0.12ms
12    OP_XC_mul                        6970         0.07ms
3     OP_RESHAPE                       508          0.01ms

Total time for invoke() - 2784985    27.85ms

Entire tensor arena in external RAM

Constraint check for tile[0]:
  Memory available:       524288,   used:      14848 .  OKAY
    (Stack: 1116, Code: 11036, Data: 2696)
Constraints checks PASSED WITH CAVEATS.
Constraint check for tile[1]:
  Memory available:       524288,   used:      81636 .  OKAY
    (Stack: 4164, Code: 49020, Data: 28452)
  ExtMem available:    134217728,   used:     158800 .  OKAY
    (Stack: 0, Code: 0, Data: 158800)
Constraints checks PASSED WITH CAVEATS.

Cumulative times for invoke()...
53    OP_XC_ld_flash                   2662671      26.63ms
14    OP_XC_conv2d_v2                  1372456      13.72ms
4     OP_UNPACK                        3335         0.03ms
4     OP_SPLIT                         12141        0.12ms
20    OP_XC_add                        47413        0.47ms
21    OP_XC_lookup                     26059        0.26ms
12    OP_XC_mul                        11189        0.11ms
3     OP_RESHAPE                       603          0.01ms

Total time for invoke() - 4135867    41.36ms

Non-persistent arena in SRAM, persistent arena in external RAM

Constraint check for tile[0]:
  Memory available:       524288,   used:      14848 .  OKAY
    (Stack: 1116, Code: 11036, Data: 2696)
Constraints checks PASSED WITH CAVEATS.
Constraint check for tile[1]:
  Memory available:       524288,   used:      231396 .  OKAY
    (Stack: 4164, Code: 49020, Data: 178212)
  ExtMem available:    134217728,   used:       9040 .  OKAY
    (Stack: 0, Code: 0, Data: 9040)
Constraints checks PASSED WITH CAVEATS.

Cumulative times for invoke()...
53    OP_XC_ld_flash                   2661568      26.62ms
14    OP_XC_conv2d_v2                  81024        0.81ms
4     OP_UNPACK                        3228         0.03ms
4     OP_SPLIT                         9784         0.10ms
20    OP_XC_add                        14514        0.15ms
21    OP_XC_lookup                     13199        0.13ms
12    OP_XC_mul                        7021         0.07ms
3     OP_RESHAPE                       507          0.01ms

Total time for invoke() - 2790845    27.91ms

As you can see, the performance we see is virtually the same when the persistent arena is moved to external RAM, and we are able to shave off 9 kB of SRAM for this particular model.

I wanted to test one of our larger models, but it turns out that the persistent arena size is actually not that big for that particular model, so the savings are unfortunately not enough to place the non-persistent arena in SRAM.

However, if we manage to shave off only 4 kB off the tensor arena, we could place the non-persistent arena in SRAM and we would see pretty significant wins for this particular model using the split arenas feature.

panickal-xmos commented 1 month ago

Thank you @andresovela. One question, for this example, where you have mentioned arena in external RAM, are you using DDR, or have you simulated that case by running the model with just one thread?

andresovela commented 1 month ago

What do you mean by DDR? Is that a feature that needs to be enabled or something?

This is all I did:

#if defined(EXTERN_TENSOR_ARENA)

#if defined(SPLIT_PERSISTENT_TENSOR_ARENA)
__attribute__ ((section(".ExtMem.bss")))
uint8_t persistent_tensor_arena[LARGEST_PERSISTENT_TENSOR_ARENA_SIZE] __attribute__((aligned(8)));
#endif

uint8_t tensor_arena[LARGEST_TENSOR_ARENA_SIZE] __attribute__((aligned(8)));

#endif
andresovela commented 1 month ago

I am running xcore-opt without the --xcore-thread-count option so I am using only 1 thread AFAIK