Closed YangYangGirl closed 2 years ago
Thanks for your reply. I try to merge your code into the timm, but I can't reproduce your accuracy. Did I miss something?
spring.submit arun --gpu -n16 \ "python train.py /data/images --opt adamw --weight-decay 0.05 --lr 1.6e-3 --warmup-lr 1e-6 --min-lr 1e-5 --decay-epochs 30 --warmup-epochs 5 --reprob 0.25 --model lvvit_s -b 64 --apex-amp --img-size 224 --drop-path 0.1 --token-label --token-label-data /data/label_top5_train_nfnet --token-label-size 14 --model-ema --model-ema-decay 0.9992 -j 8"
Can you check if you have your random augmentation --aa
correctly enabled?
Thanks a lot. I will try again :) And I found that the line below is different from timm, it should probably be changed to rank instead of local rank.
Otherwise, we may encounter multiple write model conflicts when use multiple machine training (such as GPU >8).
This is because your two machines are on the same file system. In this case, you only need to save the model on one thread (i.e. use args.rank == 0
condition ).
You can refer here for the training log of LV-ViT-S https://github.com/zihangJiang/TokenLabeling/issues/17#issuecomment-917027674 .