zihangJiang / TokenLabeling

Pytorch implementation of "All Tokens Matter: Token Labeling for Training Better Vision Transformers"
Apache License 2.0
425 stars 36 forks source link

error: download the pretrained model but couldn't be unzipped #8

Closed Williamlizl closed 3 years ago

Williamlizl commented 3 years ago

tar -xvf lvvit_s-26M-384-84-4.pth.tar tar: This does not look like a tar archive tar: Skipping to next header tar: Exiting with failure status due to previous errors

zihangJiang commented 3 years ago

Please use torch.load().

Williamlizl commented 3 years ago

Please use torch.load().

Just replace the path? /path/to/lvvit_s-26M-384-84-4.pth.tar ?

zihangJiang commented 3 years ago

Yes, please run the following command to evaluate the model you mentioned:

python3 validate.py /path/to/imagenet/val  --model lvvit_s --checkpoint /path/to/lvvit_s-26M-384-84-4.pth.tar --no-test-pool --amp --img-size 384 -b 64

For more examples, you can refer to https://github.com/zihangJiang/TokenLabeling/blob/main/eval.sh .

Williamlizl commented 3 years ago

Yes, please run the following command to evaluate the model you mentioned:

python3 validate.py /path/to/imagenet/val  --model lvvit_s --checkpoint /path/to/lvvit_s-26M-384-84-4.pth.tar --no-test-pool --amp --img-size 384 -b 64

For more examples, you can refer to https://github.com/zihangJiang/TokenLabeling/blob/main/eval.sh .

Yes, but I want to finetune the model on my own datasets: ~/TokenLabeling/pretrained$ CUDA_VISIBLE_DEVICES=1,2 distributed_train.sh 2 ./dataset/DRiD --model lvvit_s -b 64 --apex-amp --img-size 224 --drop-path 0.1 --token-label --token-label-size 14 --dense-weight 0.0 --num-classes 2 --finetune ./pretrained/lvvit_s-26M-384-84-4.pth.tar

distributed_train.sh: command not found

Is it right?

zihangJiang commented 3 years ago

Yes! I will add more instructions to readme in our future updates.

Williamlizl commented 3 years ago

Yes! I will add more instructions to readme in our future updates.

The training set only contains 2 classes, the configs setting are right? CUDA_VISIBLE_DEVICES=1,2 ./distributed_train.sh 2 ./dataset/DRiD --model lvvit_s -b 64 --apex-amp --img-size 224 --drop-path 0.1 --token-label --token-label-size 2 --dense-weight 0.0 --num-classes 2 --finetune ./pretrained/lvvit_s-26M-384-84-4.pth.tar

but error is /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [30,0,0] Assertiont >= 0 && t < n_classesfailed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [31,0,0] Assertiont >= 0 && t < n_classes` failed. Traceback (most recent call last): File "main.py", line 834, in main() File "main.py", line 601, in main validate(model, loader_eval, validate_loss_fn, args, amp_autocast=amp_autocast) File "main.py", line 809, in validate torch.cuda.synchronize() File "/home/lbc/.local/lib/python3.6/site-packages/torch/cuda/init.py", line 380, in synchronize return torch._C._cuda_synchronize() RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f653ada78b2 in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f653aff9952 in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f653ad92b7d in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: + 0x5fec3a (0x7f658978ac3a in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x5fece6 (0x7f658978ace6 in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #5: /usr/local/bin/python3.6() [0x4c1ea2] frame #6: /usr/local/bin/python3.6() [0x484689] frame #7: /usr/local/bin/python3.6() [0x435fd7] frame #8: /usr/local/bin/python3.6() [0x435fe7] frame #9: /usr/local/bin/python3.6() [0x435fe7] frame #10: PyDict_SetItemString + 0x3c7 (0x4a7b87 in /usr/local/bin/python3.6) frame #11: PyImport_Cleanup + 0x7e (0x56222e in /usr/local/bin/python3.6) frame #12: /usr/local/bin/python3.6() [0x4228b6] frame #13: Py_Main + 0x63d (0x43c42d in /usr/local/bin/python3.6) frame #14: main + 0x162 (0x41dec2 in /usr/local/bin/python3.6) frame #15: __libc_start_main + 0xf0 (0x7f6593cb1840 in /lib/x86_64-linux-gnu/libc.so.6) frame #16: _start + 0x29 (0x41df99 in /usr/local/bin/python3.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f35118808b2 in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f3511ad2952 in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f351186bb7d in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: + 0x5fec3a (0x7f3560263c3a in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x5fece6 (0x7f3560263ce6 in /home/lbc/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #5: /usr/local/bin/python3.6() [0x4c1ea2] frame #6: /usr/local/bin/python3.6() [0x484689] frame #7: /usr/local/bin/python3.6() [0x435fd7] frame #8: /usr/local/bin/python3.6() [0x435fe7] frame #9: /usr/local/bin/python3.6() [0x435fe7] frame #10: PyDict_SetItemString + 0x3c7 (0x4a7b87 in /usr/local/bin/python3.6) frame #11: PyImport_Cleanup + 0x7e (0x56222e in /usr/local/bin/python3.6) frame #12: /usr/local/bin/python3.6() [0x4228b6] frame #13: Py_Main + 0x63d (0x43c42d in /usr/local/bin/python3.6) frame #14: main + 0x162 (0x41dec2 in /usr/local/bin/python3.6) frame #15: __libc_start_main + 0xf0 (0x7f356a78a840 in /lib/x86_64-linux-gnu/libc.so.6) frame #16: _start + 0x29 (0x41df99 in /usr/local/bin/python3.6)

Traceback (most recent call last): File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/lbc/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/lbc/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/usr/local/bin/python3.6', '-u', 'main.py', '--local_rank=1', './dataset/DRiD', '--model', 'lvvit_s', '-b', '64', '--apex-amp', '--img-size', '224', '--drop-path', '0.1', '--token-label', '--token-label-size', '2', '--dense-weight', '0.0', '--num-classes', '2', '--finetune', './pretrained/lvvit_s-26M-384-84-4.pth.tar']' died with <Signals.SIGABRT: 6>.`

zihangJiang commented 3 years ago

Hi, have you checked your dataset structure? It should be same as ImageNet folder structure mentioned in readme.

Williamlizl commented 3 years ago

Hi, have you checked your dataset structure? It should be same as ImageNet folder structure mentioned in readme.

The dataset structure is : │dataset ├──train/ │ ├── n01440764 │ │ ├── n01440764_10026.JPEG │ │ ├── n01440764_10027.JPEG │ │ ├── ...... │ ├── .txt ├──val/ │ ├── n01440764 │ │ ├── ILSVRC2012_val_00000293.JPEG │ │ ├── ILSVRC2012_val_00002138.JPEG │ │ ├── ...... │ ├── .txt

image

zihangJiang commented 3 years ago

The default training data folder and validation data folder is set here https://github.com/zihangJiang/TokenLabeling/blob/66c79771631fd8b0d2be505c30f467147a59a81d/main.py#L65-L68

In your case, I assume both train and val folders are contained in your ./dataset/DRiD/regular-fundus-training folder. You dataset structure should be

│regular-fundus-training ├──train/ │ ├── Class_a │ │ ├── Class_a_1.JPEG │ │ ├── Class_a_2.JPEG │ │ ├── ...... │ ├── Class_b │ │ ├── Class_b_1.JPEG │ │ ├── Class_b_2.JPEG │ │ ├── ...... ├──val/ │ ├── Class_a │ │ ├── Class_a_1.JPEG │ │ ├── Class_a_2.JPEG │ │ ├── ...... │ ├── Class_b │ │ ├── Class_b_1.JPEG │ │ ├── Class_b_2.JPEG │ │ ├── ......

Then you can run following command for fine-tuning.

CUDA_VISIBLE_DEVICES=1,2 ./distributed_train.sh 2 ./dataset/DRiD/regular-fundus-training --model lvvit_s -b 64 --apex-amp --img-size 224 --drop-path 0.1 --token-label --token-label-size 14 --dense-weight 0.0 --num-classes 2 --finetune ./pretrained/lvvit_s-26M-384-84-4.pth.tar
Williamlizl commented 3 years ago

The default training data folder and validation data folder is set here https://github.com/zihangJiang/TokenLabeling/blob/66c79771631fd8b0d2be505c30f467147a59a81d/main.py#L65-L68

In your case, I assume both train and val folders are contained in your ./dataset/DRiD/regular-fundus-training folder. You dataset structure should be

│regular-fundus-training ├──train/ │ ├── Class_a │ │ ├── Class_a_1.JPEG │ │ ├── Class_a_2.JPEG │ │ ├── ...... │ ├── Class_b │ │ ├── Class_b_1.JPEG │ │ ├── Class_b_2.JPEG │ │ ├── ...... ├──val/ │ ├── Class_a │ │ ├── Class_a_1.JPEG │ │ ├── Class_a_2.JPEG │ │ ├── ...... │ ├── Class_b │ │ ├── Class_b_1.JPEG │ │ ├── Class_b_2.JPEG │ │ ├── ......

Then you can run following command for fine-tuning.

CUDA_VISIBLE_DEVICES=1,2 ./distributed_train.sh 2 ./dataset/DRiD/regular-fundus-training --model lvvit_s -b 64 --apex-amp --img-size 224 --drop-path 0.1 --token-label --token-label-size 14 --dense-weight 0.0 --num-classes 2 --finetune ./pretrained/lvvit_s-26M-384-84-4.pth.tar

Thank you, it does work.