Closed Williamlizl closed 3 years ago
Please use torch.load().
Please use torch.load().
Just replace the path? /path/to/lvvit_s-26M-384-84-4.pth.tar ?
Yes, please run the following command to evaluate the model you mentioned:
python3 validate.py /path/to/imagenet/val --model lvvit_s --checkpoint /path/to/lvvit_s-26M-384-84-4.pth.tar --no-test-pool --amp --img-size 384 -b 64
For more examples, you can refer to https://github.com/zihangJiang/TokenLabeling/blob/main/eval.sh .
Yes, please run the following command to evaluate the model you mentioned:
python3 validate.py /path/to/imagenet/val --model lvvit_s --checkpoint /path/to/lvvit_s-26M-384-84-4.pth.tar --no-test-pool --amp --img-size 384 -b 64
For more examples, you can refer to https://github.com/zihangJiang/TokenLabeling/blob/main/eval.sh .
Yes, but I want to finetune the model on my own datasets:
~/TokenLabeling/pretrained$ CUDA_VISIBLE_DEVICES=1,2 distributed_train.sh 2 ./dataset/DRiD --model lvvit_s -b 64 --apex-amp --img-size 224 --drop-path 0.1 --token-label --token-label-size 14 --dense-weight 0.0 --num-classes 2 --finetune ./pretrained/lvvit_s-26M-384-84-4.pth.tar
distributed_train.sh: command not found
Is it right?
Yes! I will add more instructions to readme in our future updates.
Yes! I will add more instructions to readme in our future updates.
The training set only contains 2 classes, the configs setting are right?
CUDA_VISIBLE_DEVICES=1,2 ./distributed_train.sh 2 ./dataset/DRiD --model lvvit_s -b 64 --apex-amp --img-size 224 --drop-path 0.1 --token-label --token-label-size 2 --dense-weight 0.0 --num-classes 2 --finetune ./pretrained/lvvit_s-26M-384-84-4.pth.tar
but error is
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [30,0,0] Assertion
t >= 0 && t < n_classesfailed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [31,0,0] Assertion
t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "main.py", line 834, in
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f35118808b2 in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f3511ad2952 in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f351186bb7d in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/lbc/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in
Hi, have you checked your dataset structure? It should be same as ImageNet folder structure mentioned in readme.
Hi, have you checked your dataset structure? It should be same as ImageNet folder structure mentioned in readme.
The dataset structure is : │dataset ├──train/ │ ├── n01440764 │ │ ├── n01440764_10026.JPEG │ │ ├── n01440764_10027.JPEG │ │ ├── ...... │ ├── .txt ├──val/ │ ├── n01440764 │ │ ├── ILSVRC2012_val_00000293.JPEG │ │ ├── ILSVRC2012_val_00002138.JPEG │ │ ├── ...... │ ├── .txt
The default training data folder and validation data folder is set here https://github.com/zihangJiang/TokenLabeling/blob/66c79771631fd8b0d2be505c30f467147a59a81d/main.py#L65-L68
In your case, I assume both train
and val
folders are contained in your ./dataset/DRiD/regular-fundus-training
folder.
You dataset structure should be
│regular-fundus-training ├──train/ │ ├── Class_a │ │ ├── Class_a_1.JPEG │ │ ├── Class_a_2.JPEG │ │ ├── ...... │ ├── Class_b │ │ ├── Class_b_1.JPEG │ │ ├── Class_b_2.JPEG │ │ ├── ...... ├──val/ │ ├── Class_a │ │ ├── Class_a_1.JPEG │ │ ├── Class_a_2.JPEG │ │ ├── ...... │ ├── Class_b │ │ ├── Class_b_1.JPEG │ │ ├── Class_b_2.JPEG │ │ ├── ......
Then you can run following command for fine-tuning.
CUDA_VISIBLE_DEVICES=1,2 ./distributed_train.sh 2 ./dataset/DRiD/regular-fundus-training --model lvvit_s -b 64 --apex-amp --img-size 224 --drop-path 0.1 --token-label --token-label-size 14 --dense-weight 0.0 --num-classes 2 --finetune ./pretrained/lvvit_s-26M-384-84-4.pth.tar
The default training data folder and validation data folder is set here https://github.com/zihangJiang/TokenLabeling/blob/66c79771631fd8b0d2be505c30f467147a59a81d/main.py#L65-L68
In your case, I assume both
train
andval
folders are contained in your./dataset/DRiD/regular-fundus-training
folder. You dataset structure should be│regular-fundus-training ├──train/ │ ├── Class_a │ │ ├── Class_a_1.JPEG │ │ ├── Class_a_2.JPEG │ │ ├── ...... │ ├── Class_b │ │ ├── Class_b_1.JPEG │ │ ├── Class_b_2.JPEG │ │ ├── ...... ├──val/ │ ├── Class_a │ │ ├── Class_a_1.JPEG │ │ ├── Class_a_2.JPEG │ │ ├── ...... │ ├── Class_b │ │ ├── Class_b_1.JPEG │ │ ├── Class_b_2.JPEG │ │ ├── ......
Then you can run following command for fine-tuning.
CUDA_VISIBLE_DEVICES=1,2 ./distributed_train.sh 2 ./dataset/DRiD/regular-fundus-training --model lvvit_s -b 64 --apex-amp --img-size 224 --drop-path 0.1 --token-label --token-label-size 14 --dense-weight 0.0 --num-classes 2 --finetune ./pretrained/lvvit_s-26M-384-84-4.pth.tar
Thank you, it does work.
tar -xvf lvvit_s-26M-384-84-4.pth.tar tar: This does not look like a tar archive tar: Skipping to next header tar: Exiting with failure status due to previous errors