Closed echatzikyriakidis closed 1 year ago
Hello @echatzikyriakidis, yes, the training process leverages HuggingFace's trainer allowing the use of multiple GPUs natively. I have yet to try training this across a cluster of machines.
As for the sampling process, no. This is one of the main bottlenecks of autoregressive models. There is a sequential dependency in the generative process, so it's impossible to parallelize this with the existing model.
Hi @avsolatorio!
Thank you again for this.
It is very helpful that distributed training can be used in multi-GPU environments. I will try it for sure. Also, if you have any feedback on this sooner please let me know!
Regarding sampling I understand that the model is autoregressive and to generate a single example,, every token is feed to the input in order to generate the next one and that a decoding strategy is used.
Parallelization for sampling I meant if it possible to generate examples using multiple GPU cards, by distributing the generation of 64 examples in GPU1 and some other batch of 64 in GPU2, etc.
Maybe I could use model1.sample(device="cuda1") and model2.sample(device="cuda2") in two different parallel threads? The only problem could be that at the end when I will merge the synthetic examples, duplicates could exist.
@echatzikyriakidis, it is possible to perform "parallel" inference using independent GPUs provided that you set different random seeds prior to sampling. The generation is stochastic, so as long as the random states are not similar, you would expect to have independent samples. Even when duplicates exist, it will be due to random processes and should not be systematic.
Thank you for the clarification!
Regarding utilizing multiple GPU cards in a single.VM in training I can verify that it works since HF's Trainer is used. I have tested it in a GCP computer engine VM with 2 NVIDIA T4 cards.
Hi @avsolatorio,
I would like to know if it is possible to distribute the training on environment with multiple GPUs or even to multiple machines?
Also, is it possible to parallelizing the sampling operation with the library?
Currently, I have a single environment with one GPU and I run the training on Google Colab.
Thanks.