Closed lingtengqiu closed 1 year ago
Same issue.
Same issue. I test the code in V100, it can work well while the loss will be nan in other devices(e.g., 30xx, A100)
Currently, I only have 3090 devices, so I cannot test it on V100. What might be the reason for that? I am curious about what makes the code to be worked device-dependently.
I ran my code excluding IDMRFLoss
, and I saw no more NaN terms. But I am not sure whether the training is done in the advisable ways or not.
Also, I found that the optimization step becomes way faster with the temporary solution above.
I ran my code excluding
IDMRFLoss
, and I saw no more NaN terms. But I am not sure whether the training is done in the advisable ways or not. Also, I found that the optimization step becomes way faster with the temporary solution above.
I ask the author, she told me you can try to use perceptual loss instead of MSR loss on the 3090/A100 devices. It may work.
Thanks for your great work.
When I train the hybrid stage, the losses become Nan, as following grpah illustrates.
And also, the process of optimization is very slow(200 steps about 30 minutes.)