zihangdai / xlnet

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Apache License 2.0
6.16k stars 1.18k forks source link

Distributed GPU Training #218

Open agemagician opened 4 years ago

agemagician commented 4 years ago

Hello,

Any plans to have a script for training XLNet on distributed GPUs?

Maybe with Horovod or MultiWorkerMirroredStrategy?

LifeIsStrange commented 4 years ago

Bert equivalent https://github.com/google-research/bert/pull/568

LifeIsStrange commented 4 years ago

https://github.com/NVIDIA/Megatron-LM

agemagician commented 4 years ago

@LifeIsStrange Thanks for the links.

I already know both of them, but as you already know they only support bert and GPT, but not XLNet.

For my use-case, I am interested in XLNet. Hopefully, we will have a distributed GPU version soon.

huseinzol05 commented 4 years ago

Actually you can, just set,

--num_core_per_host=3 --train_batch_size=30 
# 3 gpus and 30 batch will automatically divide among 3 gpus

But current implementation is using old technique distribution, you will find your RAM will leak very bad.

huseinzol05 commented 4 years ago

I created multigpus pretraining session for xlnet using mirrorstrategy.

Instruction how to use. Source code. Just copy paste this code after cloned this repository.

Please remove CUDA_VISIBLE, I put there to limit my gpus usage.

Tested on 2 TESLA V100 32GB VRAM.

agemagician commented 4 years ago

@huseinzol05 This is multi-gpu training for single node training. I am asking about distributed GPU Training for multi-nodes.

huseinzol05 commented 4 years ago

Actually you just add tf_config like this, https://lambdalabs.com/blog/tensorflow-2-0-tutorial-05-distributed-training-multi-node/

agemagician commented 4 years ago

Both your code and the official code are using "MirroredStrategy" which works for single node multi-gpu, in order to make it work for multiple nodes a "MultiWorkerMirroredStrategy" should be used.

It is also written in the blog post you post it here. "tf_config" works with "MultiWorkerMirroredStrategy".

huseinzol05 commented 4 years ago

I believe you can change it after copy pasted? lol

agemagician commented 4 years ago

Thanks for the information, but I am looking for more advanced large scale distributed training using Horovod for example.