zylo117 / Yet-Another-EfficientDet-Pytorch

The pytorch re-implement of the official efficientdet with SOTA performance in real time and pretrained weights.
GNU Lesser General Public License v3.0
5.2k stars 1.27k forks source link

RuntimeError: CUDA error: device-side assert triggered #641

Closed Ashutosh1995 closed 3 years ago

Ashutosh1995 commented 3 years ago

@zylo117 I am training Efficient D0 with my custom dataset after resolving the issue earlier I had.

I am running the following command

python train.py -c 0 -p indian_road --batch_size 4 --lr 1e-3 --num_epochs 15 --load_weights weights/efficientdet-d0.pth

During the final stage of training, I am getting the following error

image

Kindly help!

nTnZone commented 3 years ago

I used my own custom dataset

nTnZone commented 3 years ago

I used my own custom dataset

I figure it it out by check out category items.

Ashutosh1995 commented 3 years ago

@nTnZone were you able to solve it ?

Ashutosh1995 commented 3 years ago

@zylo117 in the collater function, you append the annotation matrix with an extended version having -1's.

This means the labels are also getting -1's right ? Can that lead to this issue of cuda device triggering ?

Please suggest!

zylo117 commented 3 years ago

category id starts from 1

Ashutosh1995 commented 3 years ago

@zylo117 in my voc to coco conversion file I passed my objects starting from index 1 as below:

PRE_DEFINE_CATEGORIES = {"auto": 1, "bicycle": 2, "bus": 3, "biker": 4, "car":5, "cow": 6, "cyclist": 7, "dog": 8, "motorbike": 9, "minitruck": 10, "person": 11, "truck": 12, "van": 13, "tractor": 14, "trolley": 15}

Also in the project.yml file, I defined the obj_list as ["auto","bicycle","bus","biker","car","cow","cyclist","dog","motorbike", "minitruck","person","truck","van","tractor","trolley"] i.e same order as above

Is there something else which I should do since the training code always stops during validation when images are tansfered on cuda printing the same error: CUDA error: device-side assert triggered

zylo117 commented 3 years ago

It's hard to tell without any details

Ashutosh1995 commented 3 years ago

@zylo117 Could you please tell me what details you want so that the issue can be fixed?

zylo117 commented 3 years ago

what's the error? logs?

Ashutosh1995 commented 3 years ago

In the training loop, I get the following error:

image

and when the val loop begins, the code breaks giving the following error image

zylo117 commented 3 years ago

image So you can still manage to train for a few steps? Could it be OOM? You should monitor vram in nvidia-smi.

Ashutosh1995 commented 3 years ago

image

In the same epoch, when the validation stage enters, it triggers the warning and then quits.

Is the value showing in nvidia-smi is what vram is ?

If no, could you please give a pointer on how to calculate vram ?

zylo117 commented 3 years ago

validation?But there's a few thousand steps remaining. image

Did you modify the code? Can you run the tutorials?

Ashutosh1995 commented 3 years ago

Actually, training gets completed triggering the error shot as a warning I pasted earlier.

It's when the validation phase starts, the code breaks and outputs RUNTime error: CUDA device asserted

I did not modify the code

I will run the tutorials also. I ran the test code and it ran perfectly.

Ashutosh1995 commented 3 years ago

It got resolved. Thanks!