Closed SandyPanda-MLDL closed 2 months ago
That just means you don't have enough memory in your GPU to run this. Try reducing batch_size and max_len in config.
That just means you don't have enough memory in your GPU to run this. Try reducing batch_size and max_len in config.
But my batch size is already 2 and batch percentage is 0.5 . I am sharing my config file here:
log_dir: "/hdd2/Sandipan/SDhar-Projects/StyleTTS2/Models/New_Hindi_Speech_2nd" first_stage_path: "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Log_files/epoch_1st_00037.pth" save_freq: 2
log_interval: 10 device: "cuda"
epochs_1st: 200 # number of epochs for first stage training (pre-training)
epochs_2nd: 100 # number of peochs for second stage training (joint training) batch_size: 2 max_len: 100
pretrained_model: "" second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters
F0_path: "Utils/JDC/bst.t7" ASR_config: "Utils/ASR/config.yml" ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT_all_languages/'
data_params:
train_data: "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Hindi_Data_Phoneme/train.txt"
val_data: "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Hindi_Data_Phoneme/valid.txt"
root_path: "/hdd2/Sandipan/database/Hindi_ASR_200/Hindi_Clean/"
OOD_data: "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Hindi_Data_Phoneme/odd.txt"
min_length: 50 # sample until texts with this size are obtained for OOD texts
preprocess_params: sr: 24000 spect_params: n_fft: 2048 win_length: 1200 hop_length: 300
model_params: multispeaker: true #true #false
dim_in: 64 hidden_dim: 512 max_conv_dim: 512 n_layer: 3 n_mels: 80
n_token: 178 # number of phoneme tokens max_dur: 50 # maximum duration of a single phoneme style_dim: 128 # style vector size
dropout: 0.2
######### config for decoder
############################## decoder: type: 'hifigan' # either hifigan or istftnet resblock_kernel_sizes: [3,7,11] upsample_rates : [10,5,3,2] upsample_initial_channel: 512 resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]] upsample_kernel_sizes: [20,10,6,4]
slm: model: 'microsoft/wavlm-base-plus' sr: 16000 # sampling rate of SLM hidden: 768 # hidden size of SLM nlayers: 13 # number of layers of SLM initial_channel: 64 # initial channels of SLM discriminator head
diffusion: embedding_mask_proba: 0.1
transformer:
num_layers: 3
num_heads: 8
head_features: 64
multiplier: 2
# diffusion distribution config
dist:
sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
mean: -3.0
std: 1.0
loss_params: lambda_mel: 5. # mel reconstruction loss lambda_gen: 1. # generator loss lambda_slm: 1. # slm feature matching loss
lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
TMA_epoch: 50 # TMA starting epoch (1st stage)
lambda_F0: 1. # F0 reconstruction loss (2nd stage)
lambda_norm: 1. # norm reconstruction loss (2nd stage)
lambda_dur: 1. # duration loss (2nd stage)
lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
lambda_sty: 1. # style reconstruction loss (2nd stage)
lambda_diff: 1. # score matching loss (2nd stage)
diff_epoch: 20 # style diffusion starting epoch (2nd stage)
joint_epoch: 50 # joint training starting epoch (2nd stage)
optimizer_params: lr: 0.0001 # general learning rate bert_lr: 0.00001 # learning rate for PLBERT ft_lr: 0.00001 # learning rate for acoustic modules
slmadv_params: min_len: 100
max_len: 200 batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
iter: 10 # update the discriminator every this iterations of generator update thresh: 5 # gradient norm above which the gradient is scaled scale: 0.01 # gradient scaling factor for predictors from SLM discriminators sig: 1.5 # sigma for differentiable duration modeling
i assume this happens right at the beginning. It says here : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 7; 79.15 GiB total capacity; 2.32 GiB already allocated; 3.19 MiB free; 2.37 GiB reserved in total by PyTorch that only 2.37 can b allocated by torch, so is there anything else running on your GPU ?
i assume this happens right at the beginning. It says here : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 7; 79.15 GiB total capacity; 2.32 GiB already allocated; 3.19 MiB free; 2.37 GiB reserved in total by PyTorch that only 2.37 can b allocated by torch, so is there anything else running on your GPU ?
Actually I am running my code in our Lab server, there are a 8 GPUs out of which 4-5 GPUs are already in used for other's code execution. I am running my code in specific GPU id (7), which is not used by anyone else as of now.
i assume this happens right at the beginning. It says here : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 7; 79.15 GiB total capacity; 2.32 GiB already allocated; 3.19 MiB free; 2.37 GiB reserved in total by PyTorch that only 2.37 can b allocated by torch, so is there anything else running on your GPU ?
output of nvidia-smi command for GPU id 7 which I am using :
7 NVIDIA L40S Off | 00000000:24:00.0 Off | 0 | | N/A 36C P8 23W / 350W | 3MiB / 46068MiB | 0% Default | | | | N/A |
It seems that there is some issue somewhere but i can't really put my finger on it. GPU 7 seems to be a 48 GB Card, yet torch says it's an 80 ? What command are you using to run the code ? There are issues sometimes in the code where it's .to("cuda") instead of .to("device") maybe that would help solve it ?
i assume this happens right at the beginning. It says here : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 7; 79.15 GiB total capacity; 2.32 GiB already allocated; 3.19 MiB free; 2.37 GiB reserved in total by PyTorch that only 2.37 can b allocated by torch, so is there anything else running on your GPU ?
As I make changes to the specific lines of code where the issue was raised previously, next time the same issue appeared in different lines of codes. As for example:
File "train_second.py", line 827, in
It seems that there is some issue somewhere but i can't really put my finger on it. GPU 7 seems to be a 48 GB Card, yet torch says it's an 80 ? What command are you using to run the code ? There are issues sometimes in the code where it's .to("cuda") instead of .to("device") maybe that would help solve it ?
I am using simply this command python train_second.py
It seems that there is some issue somewhere but i can't really put my finger on it. GPU 7 seems to be a 48 GB Card, yet torch says it's an 80 ? What command are you using to run the code ? There are issues sometimes in the code where it's .to("cuda") instead of .to("device") maybe that would help solve it ?
This is how I am setting the device id, and then using " to(device) " in the required parts of the code.
device_id=7 device = torch.device((device_id) if torch.cuda.is_available() else "cpu")
It seems that there is some issue somewhere but i can't really put my finger on it. GPU 7 seems to be a 48 GB Card, yet torch says it's an 80 ? What command are you using to run the code ? There are issues sometimes in the code where it's .to("cuda") instead of .to("device") maybe that would help solve it ?
in my code I have already replaced all "to(cuda)" with "to(device)"
This seems to be an issue that is not linked to StyleTTS, i tried to do something similar and it seemed okay. Have you tried to change device to just cuda, and use CUDA_VISIBLE_DEVICES=7 ?
This seems to be an issue that is not linked to StyleTTS, i tried to do something similar and it seemed okay. Have you tried to change device to just cuda, and use CUDA_VISIBLE_DEVICES=7 ?
No, let me do then
This seems to be an issue that is not linked to StyleTTS, i tried to do something similar and it seemed okay. Have you tried to change device to just cuda, and use CUDA_VISIBLE_DEVICES=7 ?
Thank you. Actually it seems the problem was in my end only with the GPU I am specifying. I used CUDA_VISIBLE_DEVICES command and set different GPU ids whenever I have found an idle GPU in our server. But, CUDA_VISIBLE_DEVICES=GPU id was executing my code to other GPUs instead of running my code into the specific GPU id I was specifying. That's why I set the GPU id using device_id=7 device = torch.device((device_id) if torch.cuda.is_available() else "cpu") command. However, still I was getting CUDA out of memory error.
But this time when I executed my code " CUDA_VISIBLE_DEVICES=5 python train_second.py", my code started running. I understood, I must have to do this kind of hit and trial.
Thanks for your suggestion.
@SandyPanda-MLDL would you mind closing this issue if it's resolved please?
Sure
On Sat, 31 Aug 2024, 20:30 Martin Ambrus, @.***> wrote:
@SandyPanda-MLDL https://github.com/SandyPanda-MLDL would you mind closing this issue if it's resolved please?
— Reply to this email directly, view it on GitHub https://github.com/yl4579/StyleTTS2/issues/256#issuecomment-2322924718, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOSXU5CKU7SOF7PEHLEFODTZUHK7HAVCNFSM6AAAAABJVJJ74KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSHEZDINZRHA . You are receiving this because you were mentioned.Message ID: @.***>
While executing the stage2 training I am getting cuda out of memory error continuously. I am executing stage2 training code in NVIDIA L40S GPU.
File "train_second.py", line 827, in
main()
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/click/core.py", line 1157, in call
return self.main(args, kwargs)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/click/core.py", line 783, in invoke
return __callback(args, kwargs)
File "train_second.py", line 428, in main
y_rec_gt_pred = model.decoder(en, F0_real, N_real, s)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Modules/hifigan.py", line 478, in forward
x = self.generator(x, s, F0_curve)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Modules/hifigan.py", line 341, in forward
xs += self.resblocks[iself.num_kernels+j](x, s)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Modules/hifigan.py", line 67, in forward
xt = n1(x, s)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/hdd5/Sandipan/SDhar-Projects/StyleTTS2/Modules/hifigan.py", line 21, in forward
h = self.fc(s)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/hdd5/Sandipan/envs/styletts1/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 7; 79.15 GiB total capacity; 2.32 GiB already allocated; 3.19 MiB free; 2.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF