Comments about Multi-GPU and Torch

Thanks for creating this comparison page. I think it will be usefull for many people. Few comments:

CNTK Multi-GPU. The paper you mentioned only presents results for fully-connected networks. And it can not be compared to distributed training of CNNs, for example. It turns out that distributed training of CNNs is much harder in a sense of acheiving high scalability factor. Currently noone demonstrated predictable and fast distributed training of CNNs.
Model deployment. I think Torch mark could be 0.5 higher. The reason is since it is written on C and Lua, and lua in turn also written on C, very compact and embeddable, you actually can make Torch models run on exotic processors or DSPs (and even FPGA) which only have C compiler available.
Architecture. This might be subjective but I would give Torch smaller mark. nn module alone looks good, but Torch is more then just nn, so as an inheritance of using Lua you have to use a lot of other modules.

Thanks @sirotenko

Agreed that distributed training of CNNs is harder. However, keep in mind that: (a) although there aren't many empirical evidences, 1-bit (or 2/4-bit) quantization of the gradient is a generic and cool technique for distributed training overall; (b) We don't have good benchmarks even single-node training, let alone distributed training; hence I don't provide ratings in the multi-GPU performance area (although I personally believe that CNTK would easily win).
Model deployment. It's not about whether something can be done but whether it can be done easily and fits well with the rest of the production pipeline. As an evidence, my friend had trouble deploying a trained Torch model on Android.
It's true that using Torch, you may need to use many modules. I personally like it because it keeps the architecture modular and compact. For example, most NN toolkits would embed SGD, etc. but in Torch, SGD is contained in the optim package and the expected scope of optim isn't limited to just NN or ML.

zer0n / deepframeworks