This blog post was originally written for the Baidu Research technical blog, and is reproduced here with their permission. Since then, these ideas have evolved and been incorporated into the excellent Horovod library by Uber, which is the easiest way to use MPI or NCCL for multi-GPU or multi-node deep learning applications.
Bringing HPC Techniques to Deep Learning - Andrew Gibiansky http://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/
This blog post was originally written for the Baidu Research technical blog, and is reproduced here with their permission. Since then, these ideas have evolved and been incorporated into the excellent Horovod library by Uber, which is the easiest way to use MPI or NCCL for multi-GPU or multi-node deep learning applications.
ring-allreduce图解
ring allreduce和tree allreduce的具体区别是什么? - 知乎 https://www.zhihu.com/question/57799212/answer/292494636 ring-allreduce简介 - Brassica_的菜园 - CSDN博客 https://blog.csdn.net/dpppBR/article/details/80445569
多卡gpu训练的缺陷:每次需一个gpu从其他gpu上收集训练的梯度,然后将新模型参数分发到其他gpu。最大的缺陷是gpu 0的通信时间是随着gpu卡数的增长而线性增长的。
所以就有了ring-allreduce。该算法思想:取消Reducer,让数据在gpu形成的环内流动,整个ring-allreduce的过程分为两步:第一步是scatter-reduce,第二步是allgather。
举一个3gpu的例子。首先是第一步,scatter-reduce:
第二步,allgather: