zhuohan123 / terapipe

65 stars 5 forks source link

FP16 Mixed Precision #12

Closed sguo35 closed 4 years ago

sguo35 commented 4 years ago

Adds manually implemented mixed precision for NCCL models, some larger models may have gradient error as high as 1e-4 due to inaccuracy in reductions (most reductions are not performed in FP32 yet). Speedup is up to 3-4x on large models that can fully saturate the GPU. The NCCL code follows the method outlined at https://on-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal%20Speaker_Michael%20Carilli_PDF%20For%20Sharing.pdf

I only tested with p3.8xlarge so hopefully it works for multi-node.

zhuohan123 commented 4 years ago

@sguo35 Is there any reason why we don't use amp.initialize for the model with NCCL?

sguo35 commented 4 years ago

@sguo35 Is there any reason why we don't use amp.initialize for the model with NCCL?

Amp has issues and not really any strong support for manual partial backward passes like we are doing. I had issues getting it to work but maybe I’m missing something.