yu4u / noise2noise

An unofficial and partial Keras implementation of "Noise2Noise: Learning Image Restoration without Clean Data"
MIT License
1.08k stars 234 forks source link

loss nan or inf #60

Open HongChow opened 1 year ago

HongChow commented 1 year ago

training data : https://cv.snu.ac.kr/research/VDSR/train_data.zip keras=2.1.6 and tensorflow-gpu==1.5.0 but the loss is nan of inf such as : 1/1000 [..............................] - ETA: 48:13 - loss: 33339.6055 - PSNR: 5.4144 3/1000 [..............................] - ETA: 16:30 - loss: nan - PSNR: 4.5721
5/1000 [..............................] - ETA: 10:10 - loss: nan - PSNR: 4.0657 7/1000 [..............................] - ETA: 7:27 - loss: nan - PSNR: 3.8570 9/1000 [..............................] - ETA: 5:56 - loss: nan - PSNR: 3.9201 11/1000 [..............................] - ETA: 4:58 - loss: nan - PSNR: 3.9512 @yu4u could you please help with this?

HongChow commented 1 year ago

loss of the first step and sometimes 3th step is valid but get inf or nan for more steps...

HongChow commented 1 year ago

if i use cpu-tensorflow it works ok

tangrc commented 1 year ago

@HongChow

training data : https://cv.snu.ac.kr/research/VDSR/train_data.zip keras=2.1.6 and tensorflow-gpu==1.5.0 but the loss is nan of inf such as : 1/1000 [..............................] - ETA: 48:13 - loss: 33339.6055 - PSNR: 5.4144 3/1000 [..............................] - ETA: 16:30 - loss: nan - PSNR: 4.5721 5/1000 [..............................] - ETA: 10:10 - loss: nan - PSNR: 4.0657 7/1000 [..............................] - ETA: 7:27 - loss: nan - PSNR: 3.8570 9/1000 [..............................] - ETA: 5:56 - loss: nan - PSNR: 3.9201 11/1000 [..............................] - ETA: 4:58 - loss: nan - PSNR: 3.9512 @yu4u could you please help with this?

I also encountered a similar problem, have you found a solution?

HongChow commented 1 year ago

@HongChow

training data : https://cv.snu.ac.kr/research/VDSR/train_data.zip keras=2.1.6 and tensorflow-gpu==1.5.0 but the loss is nan of inf such as : 1/1000 [..............................] - ETA: 48:13 - loss: 33339.6055 - PSNR: 5.4144 3/1000 [..............................] - ETA: 16:30 - loss: nan - PSNR: 4.5721 5/1000 [..............................] - ETA: 10:10 - loss: nan - PSNR: 4.0657 7/1000 [..............................] - ETA: 7:27 - loss: nan - PSNR: 3.8570 9/1000 [..............................] - ETA: 5:56 - loss: nan - PSNR: 3.9201 11/1000 [..............................] - ETA: 4:58 - loss: nan - PSNR: 3.9512 @yu4u could you please help with this?

I also encountered a similar problem, have you found a solution?

@tangrc hi, it seems that upgrading to the latest pytorch fixs this problem.

tangrc commented 1 year ago

@HongChow thanks!

tangrc commented 12 months ago

@HongChow

training data : https://cv.snu.ac.kr/research/VDSR/train_data.zip keras=2.1.6 and tensorflow-gpu==1.5.0 but the loss is nan of inf such as : 1/1000 [..............................] - ETA: 48:13 - loss: 33339.6055 - PSNR: 5.4144 3/1000 [..............................] - ETA: 16:30 - loss: nan - PSNR: 4.5721 5/1000 [..............................] - ETA: 10:10 - loss: nan - PSNR: 4.0657 7/1000 [..............................] - ETA: 7:27 - loss: nan - PSNR: 3.8570 9/1000 [..............................] - ETA: 5:56 - loss: nan - PSNR: 3.9201 11/1000 [..............................] - ETA: 4:58 - loss: nan - PSNR: 3.9512 @yu4u could you please help with this?

I also encountered a similar problem, have you found a solution?

@tangrc hi, it seems that upgrading to the latest pytorch fixs this problem.

@HongChow hi, Regarding your reply, I have a little doubt. This project is based on the tf framework. Why do you say it is updated to the latest torch? thanks again!When I run, the loss is very large and the val loss is nan.

HongChow commented 12 months ago

@HongChow

training data : https://cv.snu.ac.kr/research/VDSR/train_data.zip keras=2.1.6 and tensorflow-gpu==1.5.0 but the loss is nan of inf such as : 1/1000 [..............................] - ETA: 48:13 - loss: 33339.6055 - PSNR: 5.4144 3/1000 [..............................] - ETA: 16:30 - loss: nan - PSNR: 4.5721 5/1000 [..............................] - ETA: 10:10 - loss: nan - PSNR: 4.0657 7/1000 [..............................] - ETA: 7:27 - loss: nan - PSNR: 3.8570 9/1000 [..............................] - ETA: 5:56 - loss: nan - PSNR: 3.9201 11/1000 [..............................] - ETA: 4:58 - loss: nan - PSNR: 3.9512 @yu4u could you please help with this?

I also encountered a similar problem, have you found a solution?

@tangrc hi, it seems that upgrading to the latest pytorch fixs this problem.

@HongChow hi, Regarding your reply, I have a little doubt. This project is based on the tf framework. Why do you say it is updated to the latest torch? thanks again!When I run, the loss is very large and the val loss is nan.

oh, i am very sorry for this , i must made a mistake . it was another repo using pytorch , I upgraded to the lastest version and fixed the similar problem