Questions about distributed training using large-scale DB

wenet-e2e / wespeaker

Research and Production Oriented Speaker Verification, Recognition and Diarization Toolkit

Apache License 2.0

630 stars 109 forks source link

Questions about distributed training using large-scale DB #240

Closed greatnoble closed 8 months ago

greatnoble commented 8 months ago

@JiJiJiang

Thank you so much for providing us with such amazing development tools.

I am planning to use your development tool to train the model by adding an additional large-scale DB in addition to the Voxceleb2.

I would like to do distributed training using your tool on a total of 3 servers.

Please guide me on how to modify the run.sh file in the link below you provided.

https://github.com/wenet-e2e/wespeaker/blob/master/examples/voxceleb/v2/run.sh

Thansk.

JiJiJiang commented 8 months ago

stage 1 in run.sh is to prepare the voxceleb dataset. You should prepare your own training data as the same format as this recipe does. After that, what you only need to do is running run.sh one stage by one stage.

greatnoble commented 8 months ago

@JiJiJiang

Thank you very much for your quick reply.

Currently, both stage 1 and stage 2 for data preparation have been performed.

 torchrun --standalone --nnodes=1 --nproc_per_node=$num_gpus \
    wespeaker/bin/train.py --config $config ...

My question was what part of the code above should be modified for multi-node training using three servers (multi-instance).

I would like to perform distributed training on a large corpus using three servers.

Thanks.

JiJiJiang commented 8 months ago

See this pr for details.

greatnoble commented 8 months ago

@JiJiJiang

Thank you so much