Open ayushtues opened 8 months ago
Yes, the same here, it seems there is a bottleneck, but using accelerate seems to help a little. Are you using accelerate? Try to set the num_processes.
Yes @lucasgris I am using accelerate and have played around with num_workers. Even in the graph you shared, the util hits very low points (<25% GPU util) consistently, any luck with improving that?
Not yet, but I think it is worth trying to identify where the code is slow, if I have any updates I will share here.
Confirming the problem of low GPU utilization:
It seems that some sort of computing on a single CPU core is a bottle neck:
Also having this problem with train_finetune_accelerate.py
. I haven't dug too deep but the accelerator.backward()
calls seemed to be taking a very long time, specifically this code block https://github.com/yl4579/StyleTTS2/blob/5cedc71c333f8d8b8551ca59378bdcc7af4c9529/train_finetune_accelerate.py#L449-L464
I tried the following options one by one: 1) Without accelerator and with accelerator 2) Increase the number of num_processes from 1 to 2 3) Decrease max_len from 600 to 290 4) Switch decoder from hifigan to istftnet Unsuccessfully.
Also showing low GPU utilization and high single core CPU utilization
It also seems like the issue goes away after the first epoch is finished, my GPU will start being utilized and the CPU load becomes more distributed
@borrero-c thanks for looking into this, I didn't seem to observe anything changing after 1 epoch, it stays low for me. Also accelerate.backward()
call might be taking time since its doing the backward pass, that might be expected
I did a little research and launched the profiler. Pay attention to the % of time
If we exclude it as expected:
accelerator.backward()
This increases GPU utilization by about 20% but utilization remains uneven.
If I additionally exclude line 182:
ppgs, s2s_pred, s2s_attn = model.text_aligner(mels, mask, texts)
This makes GPU utilization more uniform
Looked into it some more, my steps are taking 40-20 seconds long and the .backwards()
call is taking 20-10 seconds respectively.
When the training starts to pick up after that first epoch (and GPU is being more consistently utilized) the steps are ~4 seconds each and the backwards call takes ~2 seconds.
Also interesting to see that this code block is taking a good amount of time to complete too: https://github.com/yl4579/StyleTTS2/blob/5cedc71c333f8d8b8551ca59378bdcc7af4c9529/train_finetune_accelerate.py#L306-L312
It seems for each step ~25% of time is spent in the loop above and ~50% is spent in the .backwards()
call in line 464. Not sure how/if those could be improved, this isnt really my area of expertise
Hi, I have been trying to train a StyleTTS2 model from scratch on the LibriTTS 460 dataset, currently going through the first stage via
train_first.py
The GPU utilisation of the training is very low ~30%. I am using a single H100 with
batch_size = 8
andmax_len = 300
to fit it on a single GPU.Such low util means that the script is not using the GPU effeciently and there are potential bottlenecks to be addressed which can make the training faster.
Has anyone observed similar issues while training the model from scratch or has any ideas for improving the GPU util.
cc @yl4579