Open andresovela opened 4 months ago
I was able to test the persistent/non-persistent split feature and I think it can be very useful in some cases. I tested it with a model with the following arena sizes:
In this particular case, the entire arena would actually fit in SRAM, so this test is kind of pointless, but I wanted to see what kind of performance would we be able to achieve for a model where a split would actually make sense.
Entire tensor arena in SRAM
Constraint check for tile[0]:
Memory available: 524288, used: 14848 . OKAY
(Stack: 1116, Code: 11036, Data: 2696)
Constraints checks PASSED WITH CAVEATS.
Constraint check for tile[1]:
Memory available: 524288, used: 240436 . OKAY
(Stack: 4164, Code: 49020, Data: 187252)
Constraints checks PASSED WITH CAVEATS.
Cumulative times for invoke()...
53 OP_XC_ld_flash 2659902 26.60ms
14 OP_XC_conv2d_v2 80202 0.80ms
4 OP_UNPACK 3227 0.03ms
4 OP_SPLIT 9783 0.10ms
20 OP_XC_add 12066 0.12ms
21 OP_XC_lookup 12327 0.12ms
12 OP_XC_mul 6970 0.07ms
3 OP_RESHAPE 508 0.01ms
Total time for invoke() - 2784985 27.85ms
Entire tensor arena in external RAM
Constraint check for tile[0]:
Memory available: 524288, used: 14848 . OKAY
(Stack: 1116, Code: 11036, Data: 2696)
Constraints checks PASSED WITH CAVEATS.
Constraint check for tile[1]:
Memory available: 524288, used: 81636 . OKAY
(Stack: 4164, Code: 49020, Data: 28452)
ExtMem available: 134217728, used: 158800 . OKAY
(Stack: 0, Code: 0, Data: 158800)
Constraints checks PASSED WITH CAVEATS.
Cumulative times for invoke()...
53 OP_XC_ld_flash 2662671 26.63ms
14 OP_XC_conv2d_v2 1372456 13.72ms
4 OP_UNPACK 3335 0.03ms
4 OP_SPLIT 12141 0.12ms
20 OP_XC_add 47413 0.47ms
21 OP_XC_lookup 26059 0.26ms
12 OP_XC_mul 11189 0.11ms
3 OP_RESHAPE 603 0.01ms
Total time for invoke() - 4135867 41.36ms
Non-persistent arena in SRAM, persistent arena in external RAM
Constraint check for tile[0]:
Memory available: 524288, used: 14848 . OKAY
(Stack: 1116, Code: 11036, Data: 2696)
Constraints checks PASSED WITH CAVEATS.
Constraint check for tile[1]:
Memory available: 524288, used: 231396 . OKAY
(Stack: 4164, Code: 49020, Data: 178212)
ExtMem available: 134217728, used: 9040 . OKAY
(Stack: 0, Code: 0, Data: 9040)
Constraints checks PASSED WITH CAVEATS.
Cumulative times for invoke()...
53 OP_XC_ld_flash 2661568 26.62ms
14 OP_XC_conv2d_v2 81024 0.81ms
4 OP_UNPACK 3228 0.03ms
4 OP_SPLIT 9784 0.10ms
20 OP_XC_add 14514 0.15ms
21 OP_XC_lookup 13199 0.13ms
12 OP_XC_mul 7021 0.07ms
3 OP_RESHAPE 507 0.01ms
Total time for invoke() - 2790845 27.91ms
As you can see, the performance we see is virtually the same when the persistent arena is moved to external RAM, and we are able to shave off 9 kB of SRAM for this particular model.
I wanted to test one of our larger models, but it turns out that the persistent arena size is actually not that big for that particular model, so the savings are unfortunately not enough to place the non-persistent arena in SRAM.
However, if we manage to shave off only 4 kB off the tensor arena, we could place the non-persistent arena in SRAM and we would see pretty significant wins for this particular model using the split arenas feature.
Thank you @andresovela. One question, for this example, where you have mentioned arena in external RAM, are you using DDR, or have you simulated that case by running the model with just one thread?
What do you mean by DDR? Is that a feature that needs to be enabled or something?
This is all I did:
#if defined(EXTERN_TENSOR_ARENA)
#if defined(SPLIT_PERSISTENT_TENSOR_ARENA)
__attribute__ ((section(".ExtMem.bss")))
uint8_t persistent_tensor_arena[LARGEST_PERSISTENT_TENSOR_ARENA_SIZE] __attribute__((aligned(8)));
#endif
uint8_t tensor_arena[LARGEST_TENSOR_ARENA_SIZE] __attribute__((aligned(8)));
#endif
I am running xcore-opt
without the --xcore-thread-count
option so I am using only 1 thread AFAIK
When the tensor arena requirements of a given model are larger than the available SRAM, the tensor arena has to be placed in external RAM, leaving performance on the table due to being unable to use SRAM as scratch memory at all.
I created https://github.com/tensorflow/tflite-micro/issues/2627 to ask TFLM to support this use case, but it doesn't seem to be happening any time soon. However, according to a collaborator, it's already possible to split the tensor arena into persistent/non-persistent arenas.
It seems that in order to support this use case, we would need to add this functionality to
xformer
. I can do it myself if I get some guidance.This would allow applications to place the non-persistent arena in SRAM and the persistent arena in external RAM, or vice versa, which would allow models to perform better on the xcore.ai platform.