怎么用vscode 调试train.py

wenet-e2e / wespeaker

Research and Production Oriented Speaker Verification, Recognition and Diarization Toolkit

Apache License 2.0

707 stars 116 forks source link

怎么用vscode 调试train.py #122

Closed zuowanbushiwo closed 1 year ago

zuowanbushiwo commented 1 year ago

你好我对这个项目非常感兴趣，想用vscode调试一下trian的过程。仿照run.sh的写法，修改train.py的main 函数，如下：

    if __name__ == '__main__':
    fire.Fire(train(config ='/home/yangjie/wespeaker/examples/cnceleb/v2/conf/resnet.yaml', exp_dir = '/home/yangjie/wespeaker/examples/cnceleb/v2/exp/test',
                    gpus = '[0]', data_type ='share',train_data ='/home/yangjie/wespeaker_train_data/cnceleb_train/shard_test.list',
                    train_label = '/home/yangjie/wespeaker_train_data/cnceleb_train/utt2spk',
                    reverb_data = '/home/yangjie/wespeaker_train_data/rirs/lmdb',
                    noise_data = '/home/yangjie/wespeaker_train_data/musan/lmdb'))

但是调试的时候还是在 https://github.com/wenet-e2e/wespeaker/blob/master/wespeaker/bin/train.py#L47 这里出错，感觉需要设置一些环境变量，但是我不知道怎么设置，因为以前都是使用 pytorch lightning 训练。能指导一下应该怎么修改trian.py main 函数，才能单步调试吗？谢谢！

czy97 commented 1 year ago

那一行代码是要获取pytorch DDP训练所需要的环境变量，本来我们的这个脚本是在bash里用torchrun启动的，系统环境里自动会有这个变量，你如果真的要调试，有两种方式:

把pytorch DDP相关的代码删掉, 在vscode里调试
用vscode的命令行运行，使用print或者是pdb调试

zuowanbushiwo commented 1 year ago

是的，我尝试设置 LOCAL_RANK 和 WORLD_SIZE 环境变量，但还是出现这个错误。

  File "/home/yangjie/wespeaker/examples/cnceleb/v2/wespeaker/bin/train.py", line 51, in train
    dist.init_process_group(backend='nccl')
  File "/home/yangjie/miniconda3/envs/wespeaker/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/yangjie/miniconda3/envs/wespeaker/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 247, in _env_rendezvous_handler
    rank = int(_get_env_or_raise("RANK"))
  File "/home/yangjie/miniconda3/envs/wespeaker/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 232, in _get_env_or_raise
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

zuowanbushiwo commented 1 year ago

@czy97 是不是只有train.py 文件里面有DDP相关的？

zuowanbushiwo commented 1 year ago

我按照第一种方法修train.py 后，运行报这个错误: AttributeError: 'ResNet' object has no attribute 'module' 是这行引起的： https://github.com/wenet-e2e/wespeaker/blob/master/wespeaker/utils/executor.py#L63 这个module也是ddp添加的？

czy97 commented 1 year ago

我按照第一种方法修train.py 后，运行报这个错误: AttributeError: 'ResNet' object has no attribute 'module' 是这行引起的： https://github.com/wenet-e2e/wespeaker/blob/master/wespeaker/utils/executor.py#L63 这个module也是ddp添加的？

对，DDP是在普通的模型外面加了一层wrapper, 会多一个module这个中间层. 是只用改train.py里面的DDP设定

zuowanbushiwo commented 1 year ago

感谢，已经可以debug