实验结果复现异常

zhouMail commented 2 years ago

你好，我在复现实验的过程中，进行xe训练时，各项数据是正常的，逐步增加。在RL训练过程中，前几个epoch BLEU_4是正常的，随后Test的评估Bleu_4急剧下降(40.2降到39.1)，再后面一直维持到39.2左右，使用的是源代码，没有修改过，评估时使用的X101_grid_feats_coco_trainval.hdf5文件。请问我的实验是有什么细节没有修改吗？ @zchoi

zchoi commented 2 years ago

您好，我当时在B@4上的训练应该没有问题，这是当时的训练日志截图，模型在RL阶段第34个epoch达到最优。可以看一下你的log文件吗？我的训练是在8卡V100跑的。

zhouMail commented 2 years ago

我是在单卡A4000上跑的，单卡训练应该没有影响吧？我还在运行，暂时还没有跑完。

zchoi commented 2 years ago

嗯嗯可以等跑完看看哈

zhouMail commented 2 years ago

您好，请问多卡训练是只需要修改worldSize的值吗？我修改后加载很久都没有开始训练。

zchoi commented 2 years ago

是的，只需要修改worldSize即可，我这边是可以跑起来的，下面是输出信息。您可以检查下机器的相关配置和内存配额，并多尝试几次看看

zhanghaonan @ v100-6 in /mnt/hdd1/zhanghaonan/S2-Transformer on git:main x [5:24:14] C:1 $ bash train.sh
Namespace(annotation_folder='/home/zhanghaonan/IJCAI-release/m2_annotations', batch_size=50, exp_name='demo', features_path='/home/zhanghaonan/IJCAI-release/X101-features/X101_grid_feats_coco_trainval.hdf5', head=8, logs_folder='tensorboard_logs', m=40, num_clusters=5, refine_epoch_rl=28, resume_best=False, resume_last=False, rl_base_lr=5e-06, text2text=0, warmup=10000, workers=4, xe_base_lr=0.0001, xe_least=15, xe_most=20)

Distribute config Namespace(annotation_folder='/home/zhanghaonan/IJCAI-release/m2_annotations', batch_size=50, exp_name='demo', features_path='/home/zhanghaonan/IJCAI-release/X101-features/X101_grid_feats_coco_trainval.hdf5', head=8, logs_folder='tensorboard_logs', m=40, num_clusters=5, refine_epoch_rl=28, resume_best=False, resume_last=False, rl_base_lr=4e-05, text2text=0, warmup=10000, workers=4, xe_base_lr=0.0008, xe_least=15, xe_most=20) Rank0: Transformer Training Rank4: Loading from vocabulary Rank3: Loading from vocabulary Rank5: Loading from vocabulary Rank0: Loading from vocabulary Rank1: Loading from vocabulary Rank7: Loading from vocabulary Rank6: Loading from vocabulary Rank2: Loading from vocabulary s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 lr = 0.0002 Epoch 0 - train: 0%| | 0/1417 [00:00<?, ?it/s]s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 s: 0 rl_s: 0 Training starts /home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) s: 1 Epoch 0 - train: 20%|████████████████████ | 278/1417 [01:54<04:43, 4.02it/s, loss=4.45]

zhouMail commented 2 years ago

好的，非常感谢！

zhouMail commented 2 years ago

您好，请问这个是单机多卡训练吗？我多卡训练的时候表现为，调用的gpu占用率全部是100%，且无法继续训练。

zchoi commented 2 years ago

是的，是单机多卡训练。请问是在第几个epoch GPU占用100%呢？

zhouMail commented 2 years ago

并没有开始加载训练，dist.init_process_group("nccl", world_size=worldSize, rank=rank)在这一步的时候就已经卡住了 1662172463976

zhouMail commented 2 years ago

1662191352847 您好，可以给出你的cuda版本吗？我的是这个版本的

zchoi commented 2 years ago

您好，我的CUDA版本也是11.4

zchoi commented 2 years ago

您好，请问你的Pytorch版本是多少呢，我的是1.7.1+cu110，最好保持一致。如果还是不行，可以换台机器尝试运行下

zhouMail commented 2 years ago

好的，谢谢

zhouMail commented 2 years ago

您好，这个是我复现的test_bleu4的log文件，感觉差距有点多。

zchoi commented 2 years ago

您好，我上传了CIDEr=133.7的log日志，你可以在训练中进行对比，check一下看哪里出了问题

zchoi / S2-Transformer

实验结果复现异常 #2