Closed meriamOu closed 11 months ago
What does it mean by restart? What’s the behavior?
thank you so much for your prompt reply. Now trying to debug with multiple GPUs >3, i found out the code exits the loop when excuting this line: tokens_pred, words_pred = bert(phonemes, attention_mask=(~text_mask).int()) .
And when i run on 1 GPU: It throws this error:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when it runs tokens_pred, words_pred = bert(phonemes, attention_mask=(~text_mask).int()) .
It seems like a conflict between cuda and the accelerator
Looking forward to hearing back from you
If could be a version issue. This repo has been almost a year old. Could you check the models and the data are all on GPU instead?
yes I did. it restarts randomly in the middle of training for both one GPU or multiple GPUs trainings. could you share which torch and accelerate version are using?
I'm using
accelerate 0.18.0
torch 2.0.1
transformers 4.18.0
thank you so much for your prompt reply. Now trying to debug with multiple GPUs >3, i found out the code exits the loop when excuting this line: tokens_pred, words_pred = bert(phonemes, attention_mask=(~text_mask).int()) . And when i run on 1 GPU: It throws this error:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when it runs tokens_pred, words_pred = bert(phonemes, attention_mask=(~text_mask).int()) . It seems like a conflict between cuda and the accelerator Looking forward to hearing back from you
you need to change the following code in train() function:
original:
text_mask = length_to_mask(torch.Tensor(input_lengths))#.to(device)
new:
text_mask = length_to_mask(torch.Tensor(input_lengths)).to('cuda:0')
thank you for replying: I installed the required dependencies and I changed the code .to('cuda:0') .But it still exits the loop after some number of iterations(steps). It does not excute this line:tokens_pred, words_pred = bert(phonemes, attention_mask=(~text_mask).int()) and exit the loop. This happened for one GPU or multiple GPUs training.
I am running the training on wikipedia_20220301.es.
to('cuda:0')
If it is multiple GPU training, try to use to('cuda')
. I had the same error, and it was the solution for me. If the error still occurs, you need to debug to see if something wrong with your data.
I have issue like this, started training on tr wiki dataset its all going fine not error reported on output(using training notebook and changed row same with @tekinek ) output is below
Saving.. Step [40100/1000000], Loss: 1.49200, Vocab Loss: 0.40791, Token Loss: 1.03655 Step [40200/1000000], Loss: 1.42115, Vocab Loss: 0.43196, Token Loss: 1.34368 Step [40300/1000000], Loss: 1.48467, Vocab Loss: 0.27744, Token Loss: 1.13450 Step [40400/1000000], Loss: 1.52968, Vocab Loss: 0.45252, Token Loss: 1.17900 Step [40500/1000000], Loss: 1.52161, Vocab Loss: 0.85615, Token Loss: 1.34191 Step [40600/1000000], Loss: 1.44281, Vocab Loss: 0.38077, Token Loss: 0.85241 Step [40700/1000000], Loss: 1.49886, Vocab Loss: 0.47578, Token Loss: 1.09572 Step [40800/1000000], Loss: 1.48848, Vocab Loss: 0.56533, Token Loss: 1.04840 Step [40900/1000000], Loss: 1.46339, Vocab Loss: 0.34172, Token Loss: 1.06775 Step [41000/1000000], Loss: 1.47465, Vocab Loss: 0.42328, Token Loss: 1.05896 Step [41100/1000000], Loss: 1.48895, Vocab Loss: 0.34386, Token Loss: 0.95051 Step [41200/1000000], Loss: 1.53809, Vocab Loss: 0.57109, Token Loss: 1.43240 Step [41300/1000000], Loss: 1.51472, Vocab Loss: 0.34291, Token Loss: 1.25509 Step [41400/1000000], Loss: 1.44077, Vocab Loss: 0.30832, Token Loss: 0.90913 Step [41500/1000000], Loss: 1.49300, Vocab Loss: 0.51084, Token Loss: 1.26583 Step [41600/1000000], Loss: 1.46832, Vocab Loss: 0.23410, Token Loss: 0.73553 Step [41700/1000000], Loss: 1.38602, Vocab Loss: 0.39008, Token Loss: 1.07576 Step [41800/1000000], Loss: 1.47418, Vocab Loss: 0.40175, Token Loss: 1.50833 Step [41900/1000000], Loss: 1.45734, Vocab Loss: 0.44198, Token Loss: 1.38305 Step [42000/1000000], Loss: 1.47958, Vocab Loss: 0.39486, Token Loss: 1.32486 Step [42100/1000000], Loss: 1.45463, Vocab Loss: 0.34351, Token Loss: 0.82936 Step [42200/1000000], Loss: 1.49740, Vocab Loss: 1.11530, Token Loss: 1.20732 Step [42300/1000000], Loss: 1.48120, Vocab Loss: 0.46177, Token Loss: 1.27091 Step [42400/1000000], Loss: 1.44107, Vocab Loss: 0.32943, Token Loss: 1.01672 Step [42500/1000000], Loss: 1.48959, Vocab Loss: 0.41913, Token Loss: 1.33052 Step [42600/1000000], Loss: 1.49566, Vocab Loss: 0.47067, Token Loss: 1.29855 Step [42700/1000000], Loss: 1.48040, Vocab Loss: 0.36264, Token Loss: 1.03860 Step [42800/1000000], Loss: 1.41925, Vocab Loss: 0.26632, Token Loss: 1.08702 Step [42900/1000000], Loss: 1.43263, Vocab Loss: 0.43131, Token Loss: 1.00497 Step [43000/1000000], Loss: 1.43548, Vocab Loss: 0.45944, Token Loss: 1.23576 Step [43100/1000000], Loss: 1.44422, Vocab Loss: 0.43210, Token Loss: 1.19467 Step [43200/1000000], Loss: 1.46760, Vocab Loss: 0.58364, Token Loss: 1.43040 Step [43300/1000000], Loss: 1.46705, Vocab Loss: 0.27622, Token Loss: 0.87948 Step [43400/1000000], Loss: 1.46574, Vocab Loss: 0.26687, Token Loss: 0.97012 Step [43500/1000000], Loss: 1.41721, Vocab Loss: 0.37851, Token Loss: 0.90212 Step [43600/1000000], Loss: 1.44600, Vocab Loss: 0.55063, Token Loss: 1.23012 Step [43700/1000000], Loss: 1.44428, Vocab Loss: 0.37495, Token Loss: 1.31076 Step [43800/1000000], Loss: 1.46817, Vocab Loss: 0.99723, Token Loss: 1.23601 Step [43900/1000000], Loss: 1.43992, Vocab Loss: 0.31964, Token Loss: 0.98278 Step [44000/1000000], Loss: 1.45688, Vocab Loss: 0.59675, Token Loss: 1.01967 Step [44100/1000000], Loss: 1.42800, Vocab Loss: 0.40675, Token Loss: 0.94316 Step [44200/1000000], Loss: 1.47083, Vocab Loss: 0.31533, Token Loss: 0.91085 Step [44300/1000000], Loss: 1.47878, Vocab Loss: 0.74292, Token Loss: 1.31877 Step [44400/1000000], Loss: 1.43414, Vocab Loss: 0.41619, Token Loss: 0.87569 Step [44500/1000000], Loss: 1.40559, Vocab Loss: 0.83712, Token Loss: 1.58816 Launching training on one GPU. 154 Checkpoint loaded. Start training... Step [40100/1000000], Loss: 10.25152, Vocab Loss: 6.56193, Token Loss: 2.97999 Step [40200/1000000], Loss: 9.78631, Vocab Loss: 6.93832, Token Loss: 2.91657 Step [40300/1000000], Loss: 9.72887, Vocab Loss: 7.05161, Token Loss: 2.95276 Step [40400/1000000], Loss: 9.76388, Vocab Loss: 6.86722, Token Loss: 2.96487 Step [40500/1000000], Loss: 9.75784, Vocab Loss: 6.66469, Token Loss: 2.96348 Step [40600/1000000], Loss: 9.71858, Vocab Loss: 6.82886, Token Loss: 2.96320 Step [40700/1000000], Loss: 9.71030, Vocab Loss: 6.70475, Token Loss: 2.95046 Step [40800/1000000], Loss: 9.72021, Vocab Loss: 6.72686, Token Loss: 3.01363
same here, the number of steps increases bc it keeps restarting. if you comment #while True. the program will exist in very early steps
notebook_launcher(train, args=(), num_processes=8)
Also, when restarted this way, loss measurements do not decrease. What are the acceptable loss metrics to use in Styletts2? Would it be enough for me to use at least the metrics for the first 40k steps I wrote above? After it restarts the loss metrics go like this for me.
Checkpoint loaded. Start training... Step [40100/1000000], Loss: 10.28770, Vocab Loss: 6.66913, Token Loss: 2.94239 Step [40200/1000000], Loss: 9.78529, Vocab Loss: 6.34892, Token Loss: 2.93758 Step [40300/1000000], Loss: 9.75077, Vocab Loss: 6.90626, Token Loss: 2.94373 Step [40400/1000000], Loss: 9.66055, Vocab Loss: 6.56046, Token Loss: 2.95285 Step [40500/1000000], Loss: 9.73120, Vocab Loss: 6.58032, Token Loss: 2.94348 Step [40600/1000000], Loss: 9.71287, Vocab Loss: 6.97784, Token Loss: 2.99823 Step [40700/1000000], Loss: 9.70973, Vocab Loss: 7.33855, Token Loss: 3.07795 Step [40800/1000000], Loss: 9.69358, Vocab Loss: 6.80982, Token Loss: 2.97387 Step [40900/1000000], Loss: 9.71688, Vocab Loss: 6.89250, Token Loss: 2.93217 Step [41000/1000000], Loss: 9.70198, Vocab Loss: 7.14828, Token Loss: 2.98258 Step [41100/1000000], Loss: 9.67578, Vocab Loss: 6.47327, Token Loss: 2.94004 Step [41200/1000000], Loss: 9.70120, Vocab Loss: 6.79401, Token Loss: 2.93975 Step [41300/1000000], Loss: 9.71414, Vocab Loss: 6.81877, Token Loss: 2.96429 Step [41400/1000000], Loss: 9.69623, Vocab Loss: 6.59699, Token Loss: 2.93384 Step [41500/1000000], Loss: 9.70767, Vocab Loss: 6.82683, Token Loss: 2.93702 Step [41600/1000000], Loss: 9.67128, Vocab Loss: 6.84039, Token Loss: 2.96825 Step [41700/1000000], Loss: 9.73315, Vocab Loss: 6.69190, Token Loss: 2.98339 Step [41800/1000000], Loss: 9.68523, Vocab Loss: 6.86579, Token Loss: 2.94225 Step [41900/1000000], Loss: 9.69374, Vocab Loss: 6.75254, Token Loss: 2.98534 Step [42000/1000000], Loss: 9.70822, Vocab Loss: 6.78436, Token Loss: 2.93244 Step [42100/1000000], Loss: 9.66040, Vocab Loss: 6.95662, Token Loss: 2.93810 Step [42200/1000000], Loss: 9.67323, Vocab Loss: 6.62248, Token Loss: 2.89363 Step [42300/1000000], Loss: 9.68485, Vocab Loss: 6.57238, Token Loss: 2.94248 Step [42400/1000000], Loss: 9.70034, Vocab Loss: 6.37540, Token Loss: 2.87066 Step [42500/1000000], Loss: 9.65605, Vocab Loss: 6.54033, Token Loss: 2.96908 Step [42600/1000000], Loss: 9.67558, Vocab Loss: 6.76095, Token Loss: 2.91498 Step [42700/1000000], Loss: 9.66115, Vocab Loss: 6.75113, Token Loss: 2.92413 Step [42800/1000000], Loss: 9.69055, Vocab Loss: 6.80216, Token Loss: 3.02703 Step [42900/1000000], Loss: 9.68627, Vocab Loss: 6.51554, Token Loss: 2.99655 Step [43000/1000000], Loss: 9.69567, Vocab Loss: 6.59192, Token Loss: 2.98032 Step [43100/1000000], Loss: 9.70853, Vocab Loss: 6.72628, Token Loss: 2.91694 Step [43200/1000000], Loss: 9.69278, Vocab Loss: 6.30876, Token Loss: 2.98141 Step [43300/1000000], Loss: 9.67357, Vocab Loss: 6.42720, Token Loss: 2.90176 Step [43400/1000000], Loss: 9.66609, Vocab Loss: 6.83585, Token Loss: 2.99419 Step [43500/1000000], Loss: 9.67175, Vocab Loss: 6.99638, Token Loss: 3.02473 Step [43600/1000000], Loss: 9.69820, Vocab Loss: 7.02330, Token Loss: 2.96267 Step [43700/1000000], Loss: 9.70423, Vocab Loss: 6.60086, Token Loss: 2.92851 Step [43800/1000000], Loss: 9.68251, Vocab Loss: 6.79012, Token Loss: 2.94723 Step [43900/1000000], Loss: 9.67484, Vocab Loss: 6.79818, Token Loss: 2.94563 Step [44000/1000000], Loss: 9.74457, Vocab Loss: 6.82389, Token Loss: 2.94492 Step [44100/1000000], Loss: 9.72267, Vocab Loss: 6.59023, Token Loss: 2.94163 Step [44200/1000000], Loss: 9.67673, Vocab Loss: 6.66906, Token Loss: 3.03378 Step [44300/1000000], Loss: 9.65737, Vocab Loss: 6.47070, Token Loss: 3.01240 Step [44400/1000000], Loss: 9.67047, Vocab Loss: 7.17890, Token Loss: 3.05258 Step [44500/1000000], Loss: 9.68795, Vocab Loss: 6.26416, Token Loss: 2.92195 Step [44600/1000000], Loss: 9.70950, Vocab Loss: 6.58410, Token Loss: 2.96636 Step [44700/1000000], Loss: 9.68582, Vocab Loss: 6.81770, Token Loss: 2.94325 Step [44800/1000000], Loss: 9.65639, Vocab Loss: 6.62923, Token Loss: 2.92160 Step [44900/1000000], Loss: 9.74960, Vocab Loss: 6.76100, Token Loss: 3.02768 Step [45000/1000000], Loss: 9.70884, Vocab Loss: 6.63771, Token Loss: 2.92381 Saving.. Step [45100/1000000], Loss: 9.70385, Vocab Loss: 6.76958, Token Loss: 2.93478 Step [45200/1000000], Loss: 9.68422, Vocab Loss: 6.45621, Token Loss: 2.89889 Step [45300/1000000], Loss: 9.61993, Vocab Loss: 6.66716, Token Loss: 2.94280 Step [45400/1000000], Loss: 9.72888, Vocab Loss: 6.73698, Token Loss: 3.02977 Step [45500/1000000], Loss: 9.70757, Vocab Loss: 6.73496, Token Loss: 3.03904 Step [45600/1000000], Loss: 9.67096, Vocab Loss: 6.55992, Token Loss: 2.92315 Step [45700/1000000], Loss: 9.70785, Vocab Loss: 6.48424, Token Loss: 2.92371 Step [45800/1000000], Loss: 9.67936, Vocab Loss: 6.54059, Token Loss: 2.96156 Step [45900/1000000], Loss: 9.70811, Vocab Loss: 6.58546, Token Loss: 2.98122
I am suspecting some mismatch between the tokenizer and bert. Although it is not showing an error that helps to debug.
it is necessary to change the config file in order to train PL-Bert on Latin languages like spanish and french. Or do you have any idea how to debug this?
Further I let the training keeps running on 1 GPU although it restartrs and reloads the latest checkpoint, In this end it threw this new error:
Step [5640/1000000], Loss: 9.13946, Vocab Loss: 6.39332, Token Loss: 3.10756
Traceback (most recent call last):
File "train.py", line 204, in
notebook_launcher(train, args=(), num_processes=1, use_port=33389)
File "/usr/local/lib/python3.7/site-packages/accelerate/launchers.py", line 156, in notebooklauncher
function(*args)
File "train.py", line 141, in train
for , batch in enumerate(train_loader):
File "/usr/local/lib/python3.7/site-packages/accelerate/data_loader.py", line 387, in iter
next_batch = next(dataloader_iter)
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
Looking forward to hearing back
Also, when restarted this way, loss measurements do not decrease. What are the acceptable loss metrics to use in Styletts2? Would it be enough for me to use at least the metrics for the first 40k steps I wrote above? After it restarts the loss metrics go like this for me.
Checkpoint loaded. Start training... Step [40100/1000000], Loss: 10.28770, Vocab Loss: 6.66913, Token Loss: 2.94239 Step [40200/1000000], Loss: 9.78529, Vocab Loss: 6.34892, Token Loss: 2.93758 Step [40300/1000000], Loss: 9.75077, Vocab Loss: 6.90626, Token Loss: 2.94373 Step [40400/1000000], Loss: 9.66055, Vocab Loss: 6.56046, Token Loss: 2.95285 Step [40500/1000000], Loss: 9.73120, Vocab Loss: 6.58032, Token Loss: 2.94348 Step [40600/1000000], Loss: 9.71287, Vocab Loss: 6.97784, Token Loss: 2.99823 Step [40700/1000000], Loss: 9.70973, Vocab Loss: 7.33855, Token Loss: 3.07795 Step [40800/1000000], Loss: 9.69358, Vocab Loss: 6.80982, Token Loss: 2.97387 Step [40900/1000000], Loss: 9.71688, Vocab Loss: 6.89250, Token Loss: 2.93217 Step [41000/1000000], Loss: 9.70198, Vocab Loss: 7.14828, Token Loss: 2.98258 Step [41100/1000000], Loss: 9.67578, Vocab Loss: 6.47327, Token Loss: 2.94004 Step [41200/1000000], Loss: 9.70120, Vocab Loss: 6.79401, Token Loss: 2.93975 Step [41300/1000000], Loss: 9.71414, Vocab Loss: 6.81877, Token Loss: 2.96429 Step [41400/1000000], Loss: 9.69623, Vocab Loss: 6.59699, Token Loss: 2.93384 Step [41500/1000000], Loss: 9.70767, Vocab Loss: 6.82683, Token Loss: 2.93702 Step [41600/1000000], Loss: 9.67128, Vocab Loss: 6.84039, Token Loss: 2.96825 Step [41700/1000000], Loss: 9.73315, Vocab Loss: 6.69190, Token Loss: 2.98339 Step [41800/1000000], Loss: 9.68523, Vocab Loss: 6.86579, Token Loss: 2.94225 Step [41900/1000000], Loss: 9.69374, Vocab Loss: 6.75254, Token Loss: 2.98534 Step [42000/1000000], Loss: 9.70822, Vocab Loss: 6.78436, Token Loss: 2.93244 Step [42100/1000000], Loss: 9.66040, Vocab Loss: 6.95662, Token Loss: 2.93810 Step [42200/1000000], Loss: 9.67323, Vocab Loss: 6.62248, Token Loss: 2.89363 Step [42300/1000000], Loss: 9.68485, Vocab Loss: 6.57238, Token Loss: 2.94248 Step [42400/1000000], Loss: 9.70034, Vocab Loss: 6.37540, Token Loss: 2.87066 Step [42500/1000000], Loss: 9.65605, Vocab Loss: 6.54033, Token Loss: 2.96908 Step [42600/1000000], Loss: 9.67558, Vocab Loss: 6.76095, Token Loss: 2.91498 Step [42700/1000000], Loss: 9.66115, Vocab Loss: 6.75113, Token Loss: 2.92413 Step [42800/1000000], Loss: 9.69055, Vocab Loss: 6.80216, Token Loss: 3.02703 Step [42900/1000000], Loss: 9.68627, Vocab Loss: 6.51554, Token Loss: 2.99655 Step [43000/1000000], Loss: 9.69567, Vocab Loss: 6.59192, Token Loss: 2.98032 Step [43100/1000000], Loss: 9.70853, Vocab Loss: 6.72628, Token Loss: 2.91694 Step [43200/1000000], Loss: 9.69278, Vocab Loss: 6.30876, Token Loss: 2.98141 Step [43300/1000000], Loss: 9.67357, Vocab Loss: 6.42720, Token Loss: 2.90176 Step [43400/1000000], Loss: 9.66609, Vocab Loss: 6.83585, Token Loss: 2.99419 Step [43500/1000000], Loss: 9.67175, Vocab Loss: 6.99638, Token Loss: 3.02473 Step [43600/1000000], Loss: 9.69820, Vocab Loss: 7.02330, Token Loss: 2.96267 Step [43700/1000000], Loss: 9.70423, Vocab Loss: 6.60086, Token Loss: 2.92851 Step [43800/1000000], Loss: 9.68251, Vocab Loss: 6.79012, Token Loss: 2.94723 Step [43900/1000000], Loss: 9.67484, Vocab Loss: 6.79818, Token Loss: 2.94563 Step [44000/1000000], Loss: 9.74457, Vocab Loss: 6.82389, Token Loss: 2.94492 Step [44100/1000000], Loss: 9.72267, Vocab Loss: 6.59023, Token Loss: 2.94163 Step [44200/1000000], Loss: 9.67673, Vocab Loss: 6.66906, Token Loss: 3.03378 Step [44300/1000000], Loss: 9.65737, Vocab Loss: 6.47070, Token Loss: 3.01240 Step [44400/1000000], Loss: 9.67047, Vocab Loss: 7.17890, Token Loss: 3.05258 Step [44500/1000000], Loss: 9.68795, Vocab Loss: 6.26416, Token Loss: 2.92195 Step [44600/1000000], Loss: 9.70950, Vocab Loss: 6.58410, Token Loss: 2.96636 Step [44700/1000000], Loss: 9.68582, Vocab Loss: 6.81770, Token Loss: 2.94325 Step [44800/1000000], Loss: 9.65639, Vocab Loss: 6.62923, Token Loss: 2.92160 Step [44900/1000000], Loss: 9.74960, Vocab Loss: 6.76100, Token Loss: 3.02768 Step [45000/1000000], Loss: 9.70884, Vocab Loss: 6.63771, Token Loss: 2.92381 Saving.. Step [45100/1000000], Loss: 9.70385, Vocab Loss: 6.76958, Token Loss: 2.93478 Step [45200/1000000], Loss: 9.68422, Vocab Loss: 6.45621, Token Loss: 2.89889 Step [45300/1000000], Loss: 9.61993, Vocab Loss: 6.66716, Token Loss: 2.94280 Step [45400/1000000], Loss: 9.72888, Vocab Loss: 6.73698, Token Loss: 3.02977 Step [45500/1000000], Loss: 9.70757, Vocab Loss: 6.73496, Token Loss: 3.03904 Step [45600/1000000], Loss: 9.67096, Vocab Loss: 6.55992, Token Loss: 2.92315 Step [45700/1000000], Loss: 9.70785, Vocab Loss: 6.48424, Token Loss: 2.92371 Step [45800/1000000], Loss: 9.67936, Vocab Loss: 6.54059, Token Loss: 2.96156 Step [45900/1000000], Loss: 9.70811, Vocab Loss: 6.58546, Token Loss: 2.98122
No clear the loss is so big. The loss on my end for English checkpoints were close to 1 for both vocab and token loss.
I am suspecting some mismatch between the tokenizer and bert. Although it is not showing an error that helps to debug. it is necessary to change the config file in order to train PL-Bert on Latin languages like spanish and french. Or do you have any idea how to debug this? Further I let the training keeps running on 1 GPU although it restartrs and reloads the latest checkpoint, In this end it threw this new error: Step [5640/1000000], Loss: 9.13946, Vocab Loss: 6.39332, Token Loss: 3.10756 Traceback (most recent call last): File "train.py", line 204, in notebook_launcher(train, args=(), num_processes=1, use_port=33389) File "/usr/local/lib/python3.7/site-packages/accelerate/launchers.py", line 156, in notebooklauncher function(*args) File "train.py", line 141, in train for , batch in enumerate(train_loader): File "/usr/local/lib/python3.7/site-packages/accelerate/data_loader.py", line 387, in iter next_batch = next(dataloader_iter) File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/meri/PL-BERT-main/dataloader.py", line 107, in getitem words = [self.token_maps[w]['token'] for w in words] File "/home/meri/PL-BERT-main/dataloader.py", line 107, in words = [self.token_maps[w]['token'] for w in words] KeyError: 181065
Looking forward to hearing back
You will have to make a new token map, did you use the default token_map.pkl
?
154
@tosunozgun after restart the training the losses went quite wrong in your case. I doubt something changed in dataset or token/symble mapping. It seems you have changed the list of symboles (ipa + letters +punctualsion ..) in text_utils.py. Default size is 178, but yours is 154.
@tekinek yeah i changed it to my language like and probably its not enogh need change another parts of code. Below message appears at jupyter log when restarting
_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes
failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes
failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [27,0,0] Assertion t >= 0 && t < n_classes
failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes
failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes
failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes
failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes
failed.
@tekinek yeah i changed it to my language like and probably its not enogh need change another parts of code. Below message appears at jupyter log when restarting
_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion
t >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [27,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertiont >= 0 && t < n_classes
failed.
sorry, I have no idea on this error, which looks too deep... If your target language is tr, I recommand you don't change anyting, but the new language option 'tr' for phonemizer and regenerated token_maps.pkl accordingly (save it as token_maps_tr.pkl and change the config.yalm accordingly to make sure you are using same mapping everywhere). Both the default tokenizer and phonemizer should work for tr.
@yl4579
sorry, is there a way to evaluate or to see if training goes right?
Step [169800/1000000], Loss: 3.12783, Vocab Loss: 2.76724, Token Loss: 2.02258 Step [169900/1000000], Loss: 3.01506, Vocab Loss: 1.62503, Token Loss: 1.63829 Step [170000/1000000], Loss: 3.04034, Vocab Loss: 0.89939, Token Loss: 1.60095 Saving.. Step [170100/1000000], Loss: 2.91625, Vocab Loss: 1.38221, Token Loss: 1.85416 Step [170200/1000000], Loss: 3.09688, Vocab Loss: 1.84194, Token Loss: 1.35628 Step [170300/1000000], Loss: 3.11275, Vocab Loss: 1.78001, Token Loss: 1.71689
In my training, the loss dropped from ~10 to ~3, but I am realy not sure if the model is learning something usefull. My dataset has 20m raw sentences, the vocab size is 230k and batch size is 16, training on two 4090 cards.
@tekinek loss decreases are usually good thing. The only way to verify if it works for a downstream task is finetuning it for the downstream task, so just wait for it to train longer until converges and finetune it for TTS tasks.
same issue here have anyone found a solution?
Hello thank you so much for your great work! not sure if it is a bug, I converted train.ipynb to train.py and i launched the training on spanish. However, it restarts the training every 500 steps?