xmos / ai_tools

AI applications and tools
Other
24 stars 10 forks source link

Add example for splitting model into two models and running them in two tiles #910

Open andresovela opened 1 month ago

andresovela commented 1 month ago

First of all, sorry for spamming you with issues :sweat_smile: I'm just trying to optimize inference for performance on a model larger than 512 kB and I'm exploring all possible options at the moment.

I think one alternative would be to split the large model into two smaller models, run each one on separate tiles and have the output of the model on tile[0] be the input to the model on tile[1]. Is that technically feasible using channels or is that a no-go?

panickal-xmos commented 1 month ago

No worries, more information helps us improve the tools. Yes, that is feasible. We have been primarily focused on flash-based workflows, but there are various options to achieve what you are looking for. What is the size of the model? Would it fit within two tiles? We could communicate directly via emails or so, as that might be easier to share more info regarding the models. My email is deepakpanickal@xmos.com .

andresovela commented 1 month ago

Nice, I'll contact you via email then :)

panickal-xmos commented 1 month ago

This can be done by splitting the model yourself and compiling them separately with xmos-ai-tools. You would have to wire them up in the application source code. It's not recommended though, as you would lose quite a bit of space for code and tensor arena on both tiles. When splitting to put the model on multiple tiles, better to keep the code/tensor arena on one tile, and weights on one or more tiles.

andresovela commented 1 month ago

I agree it's a waste of resources, but it may allow a model to run faster than if you had to read the weights from flash. Realistically, I think we won't use this option, but it'd be good to have an example I guess :)