Closed JIAOJIAYUASD closed 2 years ago
the volo' code:https://github.com/sail-sg/volo
Can you list the command line or script you used for training? Or you can first try with the instructions provided here in this repo to see if the error still exists.
sh distributed_train.sh 16 /userhome/data/imagenet --model volo_d1 --img-size 224 -b 128 --lr 5.5e-4 --dr 5e-4 --drop-path 0.1 --token-label --token-label-size 14 --token-label-data /userhome/deit_test1/jiaojiayu/label_top5_train_nfnet
the distributed_train.sh is: NUM_PROC=$1 shift python3 -m torch.distributed.launch --nproc_per_node=$NUM_PROC main.py "$@"
The error seems to be caused by the mismatch of num_classes
. You can check your imagenet dataset folder to see if it is properly organized as stated here.
I am a green hand of DL. When I run the code of volo with tlt in a single or multi GPU, I get an error as follows: /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [25,0,0] Assertion
main()
File "main.py", line 664, in main
optimizers=optimizers)
File "main.py", line 773, in train_one_epoch
label_size=args.token_label_size)
File "/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py", line 90, in mixup_target
y1 = get_labelmaps_with_coords(target, num_classes, on_value=on_value, off_value=off_value, device=device, label_size=label_size)
File "/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py", line 64, in get_labelmaps_with_coords
num_classes=num_classes,device=device)
File "/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py", line 16, in get_featuremaps
_label_topk[1][:, :, :].long(),
RuntimeError: CUDA error: device-side assert triggered.
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. Traceback (most recent call last): File "main.py", line 949, inI can't fix this problem right now.