wenet-e2e / wespeaker

Research and Production Oriented Speaker Verification, Recognition and Diarization Toolkit
Apache License 2.0
707 stars 116 forks source link

怎么用vscode 调试train.py #122

Closed zuowanbushiwo closed 1 year ago

zuowanbushiwo commented 1 year ago

你好 我对这个项目非常感兴趣,想用vscode调试一下trian的过程。仿照run.sh的写法,修改train.py的main 函数,如下:

    if __name__ == '__main__':
    fire.Fire(train(config ='/home/yangjie/wespeaker/examples/cnceleb/v2/conf/resnet.yaml', exp_dir = '/home/yangjie/wespeaker/examples/cnceleb/v2/exp/test',
                    gpus = '[0]', data_type ='share',train_data ='/home/yangjie/wespeaker_train_data/cnceleb_train/shard_test.list',
                    train_label = '/home/yangjie/wespeaker_train_data/cnceleb_train/utt2spk',
                    reverb_data = '/home/yangjie/wespeaker_train_data/rirs/lmdb',
                    noise_data = '/home/yangjie/wespeaker_train_data/musan/lmdb'))

但是调试的时候还是在 https://github.com/wenet-e2e/wespeaker/blob/master/wespeaker/bin/train.py#L47 这里出错,感觉需要设置一些环境变量,但是我不知道怎么设置,因为以前都是使用 pytorch lightning 训练。 能指导一下应该怎么修改trian.py main 函数,才能单步调试吗? 谢谢!

czy97 commented 1 year ago

那一行代码是要获取pytorch DDP训练所需要的环境变量,本来我们的这个脚本是在bash里用torchrun启动的,系统环境里自动会有这个变量,你如果真的要调试,有两种方式:

zuowanbushiwo commented 1 year ago

是的,我尝试设置 LOCAL_RANK 和 WORLD_SIZE 环境变量,但还是出现这个错误。

  File "/home/yangjie/wespeaker/examples/cnceleb/v2/wespeaker/bin/train.py", line 51, in train
    dist.init_process_group(backend='nccl')
  File "/home/yangjie/miniconda3/envs/wespeaker/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/yangjie/miniconda3/envs/wespeaker/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 247, in _env_rendezvous_handler
    rank = int(_get_env_or_raise("RANK"))
  File "/home/yangjie/miniconda3/envs/wespeaker/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 232, in _get_env_or_raise
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
zuowanbushiwo commented 1 year ago

@czy97 是不是只有train.py 文件里面有DDP相关的?

zuowanbushiwo commented 1 year ago

我按照第一种方法修train.py 后,运行报这个错误: AttributeError: 'ResNet' object has no attribute 'module' 是这行引起的: https://github.com/wenet-e2e/wespeaker/blob/master/wespeaker/utils/executor.py#L63 这个module也是ddp添加的?

czy97 commented 1 year ago

我按照第一种方法修train.py 后,运行报这个错误: AttributeError: 'ResNet' object has no attribute 'module' 是这行引起的: https://github.com/wenet-e2e/wespeaker/blob/master/wespeaker/utils/executor.py#L63 这个module也是ddp添加的?

对,DDP是在普通的模型外面加了一层wrapper, 会多一个module这个中间层. 是只用改train.py里面的DDP设定

zuowanbushiwo commented 1 year ago

感谢,已经可以debug