zihangJiang / TokenLabeling

Pytorch implementation of "All Tokens Matter: Token Labeling for Training Better Vision Transformers"
Apache License 2.0
425 stars 36 forks source link

RuntimeError: CUDA error: device-side assert triggered #16

Closed JIAOJIAYUASD closed 2 years ago

JIAOJIAYUASD commented 3 years ago

I am a green hand of DL. When I run the code of volo with tlt in a single or multi GPU, I get an error as follows: /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [25,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. Traceback (most recent call last): File "main.py", line 949, in main() File "main.py", line 664, in main optimizers=optimizers) File "main.py", line 773, in train_one_epoch label_size=args.token_label_size) File "/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py", line 90, in mixup_target y1 = get_labelmaps_with_coords(target, num_classes, on_value=on_value, off_value=off_value, device=device, label_size=label_size) File "/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py", line 64, in get_labelmaps_with_coords num_classes=num_classes,device=device) File "/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py", line 16, in get_featuremaps _label_topk[1][:, :, :].long(), RuntimeError: CUDA error: device-side assert triggered.

I can't fix this problem right now.

JIAOJIAYUASD commented 3 years ago

the volo' code:https://github.com/sail-sg/volo

zihangJiang commented 3 years ago

Can you list the command line or script you used for training? Or you can first try with the instructions provided here in this repo to see if the error still exists.

JIAOJIAYUASD commented 3 years ago

sh distributed_train.sh 16 /userhome/data/imagenet --model volo_d1 --img-size 224 -b 128 --lr 5.5e-4 --dr 5e-4 --drop-path 0.1 --token-label --token-label-size 14 --token-label-data /userhome/deit_test1/jiaojiayu/label_top5_train_nfnet

the distributed_train.sh is: NUM_PROC=$1 shift python3 -m torch.distributed.launch --nproc_per_node=$NUM_PROC main.py "$@"

zihangJiang commented 3 years ago

The error seems to be caused by the mismatch of num_classes. You can check your imagenet dataset folder to see if it is properly organized as stated here.