yingkunwu / R-YOLOv4

This is a PyTorch-based R-YOLOv4 implementation which combines YOLOv4 model and loss function from R3Det for arbitrary oriented object detection.
114 stars 20 forks source link

训练自己的数据集,跌带到中间epoch出现错误 #22

Closed yuxin7 closed 1 year ago

yuxin7 commented 2 years ago

---- [Epoch 12/100] ---- +------------------+--------------------+--------------------+---------------------+---------------------+ | Step: 4563/38900 | loss | reg_loss | conf_loss | cls_loss | +------------------+--------------------+--------------------+---------------------+---------------------+ | YoloLayer1 | 0.5429285764694214 | 0.3865797519683838 | 0.11499390006065369 | 0.04135490208864212 | | YoloLayer2 | 0.7612576484680176 | 0.4948746860027313 | 0.16640125215053558 | 0.09998173266649246 | | YoloLayer3 | 1.0451996326446533 | 0.6731263399124146 | 0.20551152527332306 | 0.16656182706356049 | +------------------+--------------------+--------------------+---------------------+---------------------+ Total Loss: 2.349386, Runtime: 6990.246672 C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [25,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [26,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [27,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [28,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [29,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [30,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [31,0,0] Assertion input_val >= zero && input_val <= one failed. Traceback (most recent call last): File "D:/WorkSpace/PythonWorkSpace/R-YOLOv4/train.py", line 174, in t.train() File "D:/WorkSpace/PythonWorkSpace/R-YOLOv4/train.py", line 151, in train outputs, loss = self.model(imgs, targets) File "D:\Software\Anconda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "D:\WorkSpace\PythonWorkSpace\R-YOLOv4\model\yolo.py", line 35, in forward y1, loss1 = self.yolo1(x2, target) File "D:\Software\Anconda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "D:\WorkSpace\PythonWorkSpace\R-YOLOv4\model\yololayer.py", line 202, in forward cls_loss += F.binary_cross_entropy(pred_cls[obj_mask], tcls[obj_mask], reduction=self.reduction) RuntimeError: CUDA error: device-side assert triggered

yingkunwu commented 2 years ago

你好,我可以看一下你的資料集嗎? 給幾個範例就可以了,謝謝!

yuxin7 commented 2 years ago

这是我用的数据。麻烦您看一下,谢谢您的回复。您也可以通过2252685386这个qq号加我的微信。

------------------ 原始邮件 ------------------ 发件人: "kunnnnethan/R-YOLOv4" @.>; 发送时间: 2022年4月2日(星期六) 晚上6:04 @.>; @.**@.>; 主题: Re: [kunnnnethan/R-YOLOv4] 训练自己的数据集,跌带到中间epoch出现错误 (Issue #22)

你好,我可以看一下你的資料集嗎? 給幾個範例就可以了,謝謝!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

yingkunwu commented 2 years ago

这是我用的数据。麻烦您看一下,谢谢您的回复。您也可以通过2252685386这个qq号加我的微信。

我好像看不到你說的數據是在哪裡~ 你是用什麼方式呈現的呢?

yuxin7 commented 2 years ago

附件里有

---原始邮件--- 发件人: @.> 发送时间: 2022年4月3日(周日) 中午1:46 收件人: @.>; 抄送: @.**@.>; 主题: Re: [kunnnnethan/R-YOLOv4] 训练自己的数据集,跌带到中间epoch出现错误 (Issue #22)

这是我用的数据。麻烦您看一下,谢谢您的回复。您也可以通过2252685386这个qq号加我的微信。 …

我好像看不到你說的數據是在哪裡~ 你是用什麼方式呈現的呢?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

huansu commented 2 years ago

@yuxin7 我有遇到同样的问题,请问你有找到问题在哪并解决吗?谢谢

huansu commented 2 years ago

也许是中间某个txt的数据处理有问题,我缩减了部分数据,能够运行成功,可能恰好去掉了bad-data部分

huansu commented 2 years ago

@kunnnnethan 博主能够提供pytorch和torchvision的版本吗?谢谢,该问题仍旧未能顺利解决

yingkunwu commented 2 years ago

@huansu 你好謝謝你的回報,可以問一下你是在mosaic=True的狀況下訓練還是在mosaic=False的狀況下訓練會出現錯誤的呢?

huansu commented 2 years ago

@kunnnnethan 感谢博主的及时回复,我在mosaic=True/False的情况下都有尝试,都会报错误,多次尝试调整输入数据量的情况下发现,报错是随机的,同一个数据,例如第一次运行时报错,第二次却不会

yingkunwu commented 2 years ago

@huansu 我目前推測是因為物件的中心點位置不知道是什麼原因在模型訓練時跑到邊界以外了。我在yololayer的地方加了三行code Link,這樣應該可以避免錯誤再發生,不知道你可不可以幫我pull下來再試試看,感謝!

huansu commented 2 years ago

@kunnnnethan 感谢博主,我又尝试过了,debug看过一下,大概是交叉熵这个函数的问题,因为从我理解的代码逻辑来看,显式的程序我不觉得有问题,我的pytorch版本是1.11.0+cu113,能否告知博主的版本呢?我换成相同版本试试

yingkunwu commented 2 years ago

@huansu 你是用anaconda的環境嗎?

huansu commented 2 years ago

@kunnnnethan python-pip

yingkunwu commented 2 years ago

@huansu 你的python版本是多少

huansu commented 2 years ago

@kunnnnethan python3.8

yingkunwu commented 2 years ago

@huansu pytorch版本應該不影響 你用cuda10.2試試

huansu commented 2 years ago

@kunnnnethan 行,谢谢