Open ethan-digi opened 3 months ago
Have you tried lowering learning rate?
Have you tried lowering learning rate?
Yes, I have thought of lowering the learning rate 😆. I'm just looking for some loss curves known to produce desirable results so I'm going at it from an angle other than 'down good', lol. But I do appreciate the input, I expect to some level that the solution will involve LR adjustments
As stated in the title, I'm pre-training the model on a custom dataset, and am noticing that after Epoch 10 (multi-speaker, so LibriTTS epoch config) of second stage training, loss is not decreasing as I'd expect. I've noticed that it remains effectively flat. First stage training went without an issues, as did the first 10 epochs of second stage training.
For a lot of the losses, this makes sense. The vast majority of optimization happened prior to diffusion training (the first 10 epochs), and as such gains seem like they won't be terribly perceptible. Makes perfect sense. I also know that Gen and Discr. Losses tend not to decrease, as is the nature of their architecture.
However, neither Style nor Diffusion loss have decreased (see final loss chart screenshot), despite not being adversarial losses and despite being only introduced in the 10th epoch. I'm not sure if this is expected behavior or not, but I suspect not. When I run model checkpoints, I find the model fidelity is a bit below what I'd expect, which furthers my suspicion that training is not happening correctly.
I've attached losses and my config below:
Config is above. Highlights: batch size is 14 now, was 21 for epochs 1...10. Yes, I have seven (7) GPUs. I don't think that has to do with this, but 14 and 21 would be odd batch sizes otherwise. LR at default values.
I've made very slight modifications to the code for data loading, but nothing that should impact performance. I also modified one line in the diffusion module in
sampler.py
, but this is just a nan check on config:sigma_data = self.sigma_data if not torch.isnan(torch.tensor(self.sigma_data)) else 0.2
Here's the config for the most recent epoch:
I would appreciate some insight, or if anyone who has successfully trained from scratch could post their loss curves/configs.