zchoi / S2-Transformer

[IJCAI 2022] Official Pytorch code for paper “S2 Transformer for Image Captioning”
https://www.ijcai.org/proceedings/2022/0224.pdf
MIT License
80 stars 4 forks source link

实验结果复现异常 #2

Closed zhouMail closed 2 years ago

zhouMail commented 2 years ago

你好,我在复现实验的过程中,进行xe训练时,各项数据是正常的,逐步增加。在RL训练过程中,前几个epoch BLEU_4是正常的,随后Test的评估Bleu_4急剧下降(40.2降到39.1),再后面一直维持到39.2左右,使用的是源代码,没有修改过,评估时使用的X101_grid_feats_coco_trainval.hdf5文件。请问我的实验是有什么细节没有修改吗? @zchoi

zchoi commented 2 years ago

image

您好,我当时在B@4上的训练应该没有问题,这是当时的训练日志截图,模型在RL阶段第34个epoch达到最优。可以看一下你的log文件吗?我的训练是在8卡V100跑的。

zhouMail commented 2 years ago

我是在单卡A4000上跑的,单卡训练应该没有影响吧?我还在运行,暂时还没有跑完。

zchoi commented 2 years ago

嗯嗯可以等跑完看看哈

zhouMail commented 2 years ago

您好,请问多卡训练是只需要修改worldSize的值吗?我修改后加载很久都没有开始训练。

zchoi commented 2 years ago

是的,只需要修改worldSize即可,我这边是可以跑起来的,下面是输出信息。您可以检查下机器的相关配置和内存配额,并多尝试几次看看

zhanghaonan @ v100-6 in /mnt/hdd1/zhanghaonan/S2-Transformer on git:main x [5:24:14] C:1 $ bash train.sh
Namespace(annotation_folder='/home/zhanghaonan/IJCAI-release/m2_annotations', batch_size=50, exp_name='demo', features_path='/home/zhanghaonan/IJCAI-release/X101-features/X101_grid_feats_coco_trainval.hdf5', head=8, logs_folder='tensorboard_logs', m=40, num_clusters=5, refine_epoch_rl=28, resume_best=False, resume_last=False, rl_base_lr=5e-06, text2text=0, warmup=10000, workers=4, xe_base_lr=0.0001, xe_least=15, xe_most=20)

Distribute config Namespace(annotation_folder='/home/zhanghaonan/IJCAI-release/m2_annotations', batch_size=50, exp_name='demo', features_path='/home/zhanghaonan/IJCAI-release/X101-features/X101_grid_feats_coco_trainval.hdf5', head=8, logs_folder='tensorboard_logs', m=40, num_clusters=5, refine_epoch_rl=28, resume_best=False, resume_last=False, rl_base_lr=4e-05, text2text=0, warmup=10000, workers=4, xe_base_lr=0.0008, xe_least=15, xe_most=20) Rank0: Transformer Training Rank4: Loading from vocabulary Rank3: Loading from vocabulary Rank5: Loading from vocabulary Rank0: Loading from vocabulary Rank1: Loading from vocabulary Rank7: Loading from vocabulary Rank6: Loading from vocabulary Rank2: Loading from vocabulary s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 lr = 0.0002 Epoch 0 - train: 0%| | 0/1417 [00:00<?, ?it/s]s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 Epoch 0 - train: 20%|████████████████████ | 278/1417 [01:54<04:43, 4.02it/s, loss=4.45]

zhouMail commented 2 years ago

好的,非常感谢!

zhouMail commented 2 years ago

您好,请问这个是单机多卡训练吗?我多卡训练的时候表现为,调用的gpu占用率全部是100%,且无法继续训练。

zchoi commented 2 years ago

是的,是单机多卡训练。请问是在第几个epoch GPU占用100%呢?

zhouMail commented 2 years ago

并没有开始加载训练,dist.init_process_group("nccl", world_size=worldSize, rank=rank)在这一步的时候就已经卡住了 1662172463976

zhouMail commented 2 years ago

1662191352847 您好,可以给出你的cuda版本吗?我的是这个版本的

zchoi commented 2 years ago

您好,我的CUDA版本也是11.4 image

zchoi commented 2 years ago

您好,请问你的Pytorch版本是多少呢,我的是1.7.1+cu110,最好保持一致。如果还是不行,可以换台机器尝试运行下

zhouMail commented 2 years ago

好的,谢谢

zhouMail commented 2 years ago

image

您好,这个是我复现的test_bleu4的log文件,感觉差距有点多。

zchoi commented 2 years ago

您好,我上传了CIDEr=133.7的log日志,你可以在训练中进行对比,check一下看哪里出了问题