The issue about training hybrid stage.

yfeng95 / SCARF

Other

252 stars 10 forks source link

The issue about training hybrid stage. #4

Closed lingtengqiu closed 1 year ago

lingtengqiu commented 1 year ago

Thanks for your great work.

When I train the hybrid stage, the losses become Nan, as following grpah illustrates.

And also, the process of optimization is very slow(200 steps about 30 minutes.)

seungjun-moon commented 1 year ago

Same issue.

lingtengqiu commented 1 year ago

Same issue. I test the code in V100, it can work well while the loss will be nan in other devices(e.g., 30xx, A100)

seungjun-moon commented 1 year ago

Currently, I only have 3090 devices, so I cannot test it on V100. What might be the reason for that? I am curious about what makes the code to be worked device-dependently.

seungjun-moon commented 1 year ago

I ran my code excluding IDMRFLoss, and I saw no more NaN terms. But I am not sure whether the training is done in the advisable ways or not. Also, I found that the optimization step becomes way faster with the temporary solution above.

lingtengqiu commented 1 year ago

I ran my code excluding IDMRFLoss, and I saw no more NaN terms. But I am not sure whether the training is done in the advisable ways or not. Also, I found that the optimization step becomes way faster with the temporary solution above.

I ask the author, she told me you can try to use perceptual loss instead of MSR loss on the 3090/A100 devices. It may work.