Add example for loading weights from external RAM

andresovela commented 1 month ago

The existing models that showcase loading weights from flash are very useful, but I'd like to see an example of loading the weights on flash, and then transfer the weights to the LPDDR1 memory for faster IO.

I took a look at the LPDDR1 docs but I'm not sure how exactly would you tie together the current xf.generate_flash() + xflash approach together with the __attribute__ ((section(".ExtMem.data"))) approach from LPDDR1.

Presumably you could do

xxd -i xcore_flash_binary.out > model_weights.cc

and annotate the generated array with __attribute__ ((section(".ExtMem.data"))) ?

Would that work?

More than 10% of the execution time is spent running OP_XC_ld_flash when running the example model using the --xcore-conv-err-threshold=0.5 option.

Cumulative times for invoke()...
1     OP_XC_pad_3_to_4                 26992        0.27ms
14    OP_XC_pad                        53368        0.53ms
30    OP_XC_ld_flash                   459310       4.59ms
29    OP_XC_conv2d_v2                  2779619      27.80ms
1     OP_XC_slice                      194          0.00ms
1     OP_RESHAPE                       132          0.00ms
1     OP_XC_softmax                    392          0.00ms

Total time for invoke() - 3320007    33.20ms

This is significantly worse on some of the models I'm trying to run, e.g.

Cumulative times for invoke()...
18    OP_XC_slice                      21092        0.21ms
15    OP_RESHAPE                       2307         0.02ms
15    OP_XC_pad_v2                     5125         0.05ms
17    OP_XC_ld_flash                   81979        0.82ms
16    OP_XC_conv2d_v2                  21638        0.22ms
3     OP_CONCATENATION                 9196         0.09ms
1     OP_CONV_2D                       171          0.00ms
0     OP_UNPACK                        0            0.00ms
0     OP_SPLIT                         0            0.00ms
0     OP_XC_lookup                     0            0.00ms
0     OP_MUL                           0            0.00ms
0     OP_XC_binaryi16                  0            0.00ms
0     OP_XC_unaryi16                   0            0.00ms
0     OP_PACK                          0            0.00ms
0     OP_SUM                           0            0.00ms

Total time for invoke() - 141508     1.42ms

Therefore I'd like to see how performance is affected by moving the weights to RAM.

andresovela commented 1 month ago

I did some digging and it doesn't look like this scenario is currently supported (I might be wrong), although it is described in the current documentation.

This system has very little contraints. It can execute large models that are held in external memory, and the tensor arena can either be held in external or internal memory.

It looks like load ops are automatically replaced with loadflash ops when the xcore-weights-file option is used. https://github.com/xmos/ai_tools/blob/3c0a431656dffd27cb99497b2c7ffb320006cbb3/xformer/Transforms/WriteWeights.cpp#L141-L142

andresovela commented 1 month ago

We tried running a model with a much larger amount of parameters and we are definitely seeing a significant bottleneck in I/O.

Cumulative times for invoke()...
53    OP_XC_ld_flash                   2659902      26.60ms
14    OP_XC_conv2d_v2                  80202        0.80ms
4     OP_UNPACK                        3227         0.03ms
4     OP_SPLIT                         9783         0.10ms
20    OP_XC_add                        12066        0.12ms
21    OP_XC_lookup                     12327        0.12ms
12    OP_XC_mul                        6970         0.07ms
3     OP_RESHAPE                       508          0.01ms

Any help with this will be appreciated.

panickal-xmos commented 1 month ago

Hi @andresovela , I need to check this in more detail. Are the weights small enough to be kept in SRAM itself (not using flash or DDR), as that would be the fastest option?

andresovela commented 1 month ago

Hi @panickal-xmos, the models we're testing out right now have around 1.4 MB of weight data, so SRAM is not an option for us at the moment. We're trying out other architectures that require fewer parameters but bigger tensors in order to get around the I/O bound. However it'd be nice to be able to just move the weights to external RAM.

andresovela commented 1 month ago

I found this option in xcore-opt:

https://github.com/xmos/ai_tools/blob/175ca9811df5916a35c52d6bf8ebeb87428fbc48/xformer/XCoreOptMain.cpp#L75-L77

It's not currently documented, but it looks like we could potentially use this to load the weights from external RAM?

panickal-xmos commented 1 month ago

Your point is very valid. It would ideally be useful to use DDR to quickly fit in a larger model. We had deprioritized DDR support as there is currently a hardware limitation which prevents us from using multiple threads when directly accessing DDR. In the generated model cpp file, you can see this,

// When using USE_DDR_FIX for enabling LPDDR support, only one thread can be used
#ifdef USE_DDR_FIX
static_assert((5 == 1),
             "Only one thread can be used when using USE_DDR_FIX! Please recompile with one thread!");
#endif

Because of this limitation, the model runs much slower and the benefit of using DDR instead of flash goes down.

The undocumented xcore-load-tile option is to use the other tile to keep weights if that tile is free. This could potentially be used to load weights from DDR into SRAM. Let me look more into this.

andresovela commented 1 month ago

I did see this limitation written down somewhere. Unfortunately I have no choice but to use LPDDR.

A bit off-topic, now that you mention multiple threads, the documentation states that maximum 5 threads can be used. Why is this the case? Doesn't the xcore.ai have 8 cores per tile? Why the 5-thread limitation? I have seen this elsewhere as well, for example:

https://github.com/xmos/xcore_iot/blob/74daaefb43c8a2714f265c3bc8799e6da88d98c5/examples/freertos/iot/src/FreeRTOSConfig.h#L19-L24

Thank you for looking into the issue for me btw, I appreciate it :)

andresovela commented 1 month ago

Btw, I had not yet tested the --xcore-thread-count option, but I just did a quick test to compare inference time using 1 thread vs 5 threads, and while some ops do take less time running on 5 threads, OP_XC_ld_flash takes pretty much exactly the same time on both runs.

Considering that OP_XC_ld_flash is overwhelmingly taking the most time in my case, it's not even a trade-off to have to sacrifice the ability to run the model on multiple threads in exchange for faster loading of weights using LPDDR.

Cumulative times for invoke()...
53    OP_XC_ld_flash                   2659902      26.60ms
14    OP_XC_conv2d_v2                  80202        0.80ms
4     OP_UNPACK                        3227         0.03ms
4     OP_SPLIT                         9783         0.10ms
20    OP_XC_add                        12066        0.12ms
21    OP_XC_lookup                     12327        0.12ms
12    OP_XC_mul                        6970         0.07ms
3     OP_RESHAPE                       508          0.01ms

panickal-xmos commented 1 month ago

Regarding,

A bit off-topic, now that you mention multiple threads, the documentation states that maximum 5 threads can be used. Why is this the case? Doesn't the xcore.ai have 8 cores per tile? Why the 5-thread limitation?

xcore.ai can support upto eight threads. However, five threads are capable of using all compute available on xcore.ai. Eight threads using all compute will be only as fast as five threads using all compute. Each of the eight threads are running slightly slower in that case. Usually with AI models, we would use five threads for AI compute, and other threads for support, such as a thread for flash_server, which does the loading of data from flash.

Considering that OP_XC_ld_flash is overwhelmingly taking the most time in my case, it's not even a trade-off to have to sacrifice the ability to run the model on multiple threads in exchange for faster loading of weights using LPDDR.

It's good that you checked the timings. The model seems very memory-bound. We will investigate DDR support and report back on if we can do something about it.

panickal-xmos commented 1 month ago

@andresovela , how does the performance look like with weights in DDR, based on this example, https://github.com/xmos/ai_tools/pull/914?

andresovela commented 1 month ago

I'll try it out in a bit and I'll report back :)

andresovela commented 1 month ago

@panickal-xmos I made the modifications according to the DDR example and I see 60% reduction in the execution time of OP_XC_ld_flash. It's a bit strange that this op is still being used even though there's no flash involved. Maybe a misnomer as well.

Cumulative times for invoke()...
53    OP_XC_ld_flash                   1064422      10.64ms
14    OP_XC_conv2d_v2                  81348        0.81ms
4     OP_UNPACK                        2812         0.03ms
4     OP_SPLIT                         11476        0.11ms
20    OP_XC_add                        9637         0.10ms
21    OP_XC_lookup                     7056         0.07ms
12    OP_XC_mul                        5520         0.06ms
3     OP_RESHAPE                       507          0.01ms

Total time for invoke() - 1182778    11.83ms

andresovela commented 1 month ago

I somehow expected that the performance gain would be much larger than 2.5x, considering that the bandwidth difference is so large (DDR 6,400 Mbit/s vs Flash 200 Mbit/s according to Henk).

panickal-xmos commented 1 month ago

Yes, it's slower due to using the tile ram server interface. I'm looking into another simpler option to directly copy weights from DDR.

andresovela commented 1 month ago

Let me know if there's anything I can do to help :)

andresovela commented 1 month ago

Hi @panickal-xmos, I wanted to do some tests with both weights and tensor arena in DDR, but since the current example uses tile[0] for reading weights and tile[1] for running the model, I get this error:

xmap: Error: Elf file ".elfibGIX8" requires external memory (DDR), and a previously linked elf file requires it too. Only one tile on a node may access DDR
xmake[1]: *** [bin//app_hello_world.xe] Error 1
xmake: *** [bin//app_hello_world.xe] Error 2

So while the DDR example in #914 is very informative, we won't be able to use it with this limitation.

panickal-xmos commented 1 month ago

Hi @andresovela, I have merged the updated example in https://github.com/xmos/ai_tools/pull/914. Along with compiler and runtime changes, DDR should be a lot faster. I have changed the DDR frequency to 166MHz for the example. Copying from DDR is slower for smaller weights, hence we use xcore-load-externally-if-larger to specify a minimum size for weights, such as https://github.com/xmos/ai_tools/pull/914/files#diff-58032a712fa7813c63c7b4690eb53efbe85cd9e04a7c293fc5c025c3965038c6R29. You will have to tweak this for your model. Let me know how it goes.

andresovela commented 1 month ago

Hi @panickal-xmos, thanks for the example! I'll try it out in a bit. Can you explain what does xcore-load-externally-if-larger do? From what I understand in the example, all weights are placed in external RAM regardless of that number:

        # write weights as an array to be placed in DDR
        "xcore-write-weights-as-array" : "True",
        "xcore-weights-in-external-memory" : "True",
        # For DDR, we want to ideally reduce loads smaller
        # than 4000 bytes, as they are slower.
        # But this would increase memory usage on tile
        # and so it is a tradeoff
        "xcore-load-externally-if-larger" : "1500",

Are the weights somehow duplicated in internal RAM for performance? I tried to find out from the source code but I couldn't figure it out.

panickal-xmos commented 1 month ago

Only if the constant tensor (weight) is larger than the amount set by xcore-load-externally-if-larger, it would be externally loaded. If it is smaller, it would remain in internal RAM. The default value for xcore-load-externally-if-larger is 96 bytes. But copying from DDR works best for larger copies, and so reducing smaller weights being offloaded speeds up the execution. The weight is not duplicated.

andresovela commented 1 month ago

I modified my test application as shown by the DDR example and I can see a further 53% reduction in the time spent loading weights, without modifying the LPDDR clock frequency.

Cumulative times for invoke()...
67    OP_XC_ld_weights                 502638       5.03ms
14    OP_XC_conv2d_v2                  81353        0.81ms
4     OP_UNPACK                        2809         0.03ms
4     OP_SPLIT                         11474        0.11ms
20    OP_XC_add                        9632         0.10ms
21    OP_XC_lookup                     7052         0.07ms
12    OP_XC_mul                        5520         0.06ms
3     OP_RESHAPE                       507          0.01ms

Total time for invoke() - 620985     6.21ms

With the system frequency set to 800 MHz and LPDDR clock set to 166 MHz I get an extra 40% reduction.

Cumulative times for invoke()...
67    OP_XC_ld_weights                 303810       3.04ms
14    OP_XC_conv2d_v2                  61012        0.61ms
4     OP_UNPACK                        2108         0.02ms
4     OP_SPLIT                         8604         0.09ms
20    OP_XC_add                        7225         0.07ms
21    OP_XC_lookup                     5290         0.05ms
12    OP_XC_mul                        4140         0.04ms
3     OP_RESHAPE                       381          0.00ms

Total time for invoke() - 392570     3.93ms

After that I was able to keep some weights in internal RAM using the xcore-load-externally-if-larger option. I was able to set it up to 64000 before the weights didn't fit. This got me a further 15% reduction.

Cumulative times for invoke()...
14    OP_XC_conv2d_v2                  60652        0.61ms
10    OP_XC_ld_weights                 258879       2.59ms
4     OP_UNPACK                        2412         0.02ms
4     OP_SPLIT                         7320         0.07ms
20    OP_XC_add                        7221         0.07ms
21    OP_XC_lookup                     5251         0.05ms
12    OP_XC_mul                        4153         0.04ms
3     OP_RESHAPE                       378          0.00ms

Total time for invoke() - 346266     3.46ms

In total with all the optimizations, I was able to get from 26.60 ms using flash, to 2.59 ms using LPDDR. That's more than 90% reduction in time spent loading weights!

Thanks for all the help so far, I appreciate it a lot!

With this, I'm closing this issue :)

andresovela commented 1 month ago

Note that the model I'm testing is an int8 one we have for reference, not the int16x8 one we sent to you via email. Unfortunately we can't test that one yet due to the tensor arena being too large.

xmos / ai_tools

Add example for loading weights from external RAM #903