xuebinqin / BASNet

Code for CVPR 2019 paper. BASNet: Boundary-Aware Salient Object Detection
MIT License
1.35k stars 249 forks source link

您好,用自己数据集请问训练后期d0和d1的输出为nan是怎么回事呢? #47

Open yihong-97 opened 3 years ago

yihong-97 commented 3 years ago

您好,我训练自己数据集的时候发现在15.6K迭代额时候,d0和d1的输出为nan,导致l0和l1损失为nan `[epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054

[epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684

[epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387

[epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494

[epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563

[epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301

[epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 `

xuebinqin commented 3 years ago

We use return F.sigmoid(d0) in the network definition. This may not be reliable in some cases. You can try to return only d0 and then replace the current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In addition, it is also good to check your input to see if they are all valid.

On Sun, Jan 10, 2021 at 3:48 AM Yihong notifications@github.com wrote:

您好,我训练自己数据集的时候发现在15.6K迭代额时候,d0和d1的输出为nan,导致l0和l1损失为nan `[epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054

[epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684

[epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387

[epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494

[epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563

[epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301

[epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 `

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NathanUA/BASNet/issues/47, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA .

-- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/

yihong-97 commented 3 years ago

We use return F.sigmoid(d0) in the network definition. This may not be reliable in some cases. You can try to return only d0 and then replace the current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In addition, it is also good to check your input to see if they are all valid. On Sun, Jan 10, 2021 at 3:48 AM Yihong @.***> wrote: 您好,我训练自己数据集的时候发现在15.6K迭代额时候,d0和d1的输出为nan,导致l0和l1损失为nan [epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054 [epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684 [epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387 [epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494 [epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563 [epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301 [epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA . -- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/

There was a problem with the same number of steps after modifying d0 and using the BCEWithLogitsLoss function. The input is valid, and it is worth noting that only d0 and d1 are nan, and the other outputs are normal.

xuebinqin commented 3 years ago

There are several other options you can try, for example (1) add the torch.nn.utils.clip_grad_norm just after the loss.backward, (2) change the dataloader by normalizing your input image as image = (image - image.min() + 1e-8)/(image.max() - image.min() + 1e-8), etc.

On Jan 10, 2021, at 8:49 PM, Yihong notifications@github.com wrote:

We use return F.sigmoid(d0) in the network definition. This may not be reliable in some cases. You can try to return only d0 and then replace the current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In addition, it is also good to check your input to see if they are all valid. … <x-msg://1/#> On Sun, Jan 10, 2021 at 3:48 AM Yihong @.***> wrote: 您好,我训练自己数据集的时候发现在15.6K迭代额时候,d0和d1的输出为nan,导致l0和l1损失为nan [epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054 [epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684 [epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387 [epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494 [epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563 [epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301 [epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47 https://github.com/NathanUA/BASNet/issues/47>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA . -- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/ https://webdocs.cs.ualberta.ca/~xuebin/ There was a problem with the same number of steps after modifying d0 and using the BCEWithLogitsLoss function. The input is valid, and it is worth noting that only d0 and d1 are nan, and the other outputs are normal.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NathanUA/BASNet/issues/47#issuecomment-757610164, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGOROFRPNWMESTSEWN7BTSZJYM3ANCNFSM4V4KN4QA.

yihong-97 commented 3 years ago

There are several other options you can try, for example (1) add the torch.nn.utils.clip_grad_norm just after the loss.backward, (2) change the dataloader by normalizing your input image as image = (image - image.min() + 1e-8)/(image.max() - image.min() + 1e-8), etc. On Jan 10, 2021, at 8:49 PM, Yihong @.> wrote: We use return F.sigmoid(d0) in the network definition. This may not be reliable in some cases. You can try to return only d0 and then replace the current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In addition, it is also good to check your input to see if they are all valid. … <x-msg://1/#> On Sun, Jan 10, 2021 at 3:48 AM Yihong @.> wrote: 您好,我训练自己数据集的时候发现在15.6K迭代额时候,d0和d1的输出为nan,导致l0和l1损失为nan [epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054 [epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684 [epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387 [epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494 [epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563 [epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301 [epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47 <#47>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA . -- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/ https://webdocs.cs.ualberta.ca/~xuebin/ There was a problem with the same number of steps after modifying d0 and using the BCEWithLogitsLoss function. The input is valid, and it is worth noting that only d0 and d1 are nan, and the other outputs are normal. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#47 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGOROFRPNWMESTSEWN7BTSZJYM3ANCNFSM4V4KN4QA.

Thank you very much. I'll try these options.