Closed zhouMail closed 2 years ago
您好,我当时在B@4上的训练应该没有问题,这是当时的训练日志截图,模型在RL阶段第34个epoch达到最优。可以看一下你的log文件吗?我的训练是在8卡V100跑的。
我是在单卡A4000上跑的,单卡训练应该没有影响吧?我还在运行,暂时还没有跑完。
嗯嗯可以等跑完看看哈
您好,请问多卡训练是只需要修改worldSize的值吗?我修改后加载很久都没有开始训练。
是的,只需要修改worldSize即可,我这边是可以跑起来的,下面是输出信息。您可以检查下机器的相关配置和内存配额,并多尝试几次看看
zhanghaonan @ v100-6 in /mnt/hdd1/zhanghaonan/S2-Transformer on git:main x [5:24:14] C:1
$ bash train.sh
Namespace(annotation_folder='/home/zhanghaonan/IJCAI-release/m2_annotations', batch_size=50, exp_name='demo', features_path='/home/zhanghaonan/IJCAI-release/X101-features/X101_grid_feats_coco_trainval.hdf5', head=8, logs_folder='tensorboard_logs', m=40, num_clusters=5, refine_epoch_rl=28, resume_best=False, resume_last=False, rl_base_lr=5e-06, text2text=0, warmup=10000, workers=4, xe_base_lr=0.0001, xe_least=15, xe_most=20)
Distribute config Namespace(annotation_folder='/home/zhanghaonan/IJCAI-release/m2_annotations', batch_size=50, exp_name='demo', features_path='/home/zhanghaonan/IJCAI-release/X101-features/X101_grid_feats_coco_trainval.hdf5', head=8, logs_folder='tensorboard_logs', m=40, num_clusters=5, refine_epoch_rl=28, resume_best=False, resume_last=False, rl_base_lr=4e-05, text2text=0, warmup=10000, workers=4, xe_base_lr=0.0008, xe_least=15, xe_most=20)
Rank0: Transformer Training
Rank4: Loading from vocabulary
Rank3: Loading from vocabulary
Rank5: Loading from vocabulary
Rank0: Loading from vocabulary
Rank1: Loading from vocabulary
Rank7: Loading from vocabulary
Rank6: Loading from vocabulary
Rank2: Loading from vocabulary
s: 0
rl_s: 0
Training starts
/home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
s: 1
s: 0
rl_s: 0
Training starts
/home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
s: 1
s: 0
rl_s: 0
Training starts
/home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
s: 1
lr = 0.0002
Epoch 0 - train: 0%| | 0/1417 [00:00<?, ?it/s]s: 0
rl_s: 0
Training starts
/home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
s: 1
s: 0
rl_s: 0
Training starts
/home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
s: 1
s: 0
rl_s: 0
Training starts
/home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
s: 1
s: 0
rl_s: 0
Training starts
/home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
s: 1
s: 0
rl_s: 0
Training starts
/home/zhanghaonan/anaconda3/envs/ic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
s: 1
Epoch 0 - train: 20%|████████████████████ | 278/1417 [01:54<04:43, 4.02it/s, loss=4.45]
好的,非常感谢!
您好,请问这个是单机多卡训练吗?我多卡训练的时候表现为,调用的gpu占用率全部是100%,且无法继续训练。
是的,是单机多卡训练。请问是在第几个epoch GPU占用100%呢?
并没有开始加载训练,dist.init_process_group("nccl", world_size=worldSize, rank=rank)在这一步的时候就已经卡住了
您好,可以给出你的cuda版本吗?我的是这个版本的
您好,我的CUDA版本也是11.4
您好,请问你的Pytorch版本是多少呢,我的是1.7.1+cu110,最好保持一致。如果还是不行,可以换台机器尝试运行下
好的,谢谢
您好,这个是我复现的test_bleu4的log文件,感觉差距有点多。
你好,我在复现实验的过程中,进行xe训练时,各项数据是正常的,逐步增加。在RL训练过程中,前几个epoch BLEU_4是正常的,随后Test的评估Bleu_4急剧下降(40.2降到39.1),再后面一直维持到39.2左右,使用的是源代码,没有修改过,评估时使用的X101_grid_feats_coco_trainval.hdf5文件。请问我的实验是有什么细节没有修改吗? @zchoi