zhijian-liu / torchpack

A neural network training interface based on PyTorch, with a focus on flexibility
https://pypi.org/project/torchpack/
MIT License
61 stars 15 forks source link

Distributed train problem #24

Closed moshicaixi closed 3 years ago

moshicaixi commented 3 years ago

when I run the commad 'torchpack dist-run -np 3 python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml' , I got a error as follows. Could you please tell me how can I resolve this problem? Thanks very much!

ssh: Could not resolve hostname localhost:3: Name or service not known

ORTE was unable to reliably start one or more daemons. This usually is caused by:

zhijian-liu commented 3 years ago

Could you share the output of mpirun --version? Thanks!

moshicaixi commented 3 years ago

I forgot to update my issue. At first, I did't know there was no mpi environment on my server, which caused this error. But when I installed mpich, I got another error. When I changed to openmpi, It became normal. Maybe there is little incompatibility between mpich and openmpi? I don't know. However, thanks for your reply!

mpich error: `[mpiexec@guest-server] match_arg (../../../../mpich-3.4.2/src/pm/hydra/utils/args/args.c:160): unrecognized argument allow-run-as-root

[mpiexec@guest-server] HYDU_parse_array (../../../../mpich-3.4.2/src/pm/hydra/utils/args/args.c:175): argument matching returned error

[mpiexec@guest-server] parse_args (../../../../mpich-3.4.2/src/pm/hydra/ui/mpich/utils.c:1603): error parsing input array

[mpiexec@guest-server] HYD_uii_mpx_get_parameters (../../../../mpich-3.4.2/src/pm/hydra/ui/mpich/utils.c:1655): unable to parse user arguments

[mpiexec@guest-server] main (../../../../mpich-3.4.2/src/pm/hydra/ui/mpich/mpiexec.c:128): error parsing parameters`

zhijian-liu commented 3 years ago

Sounds good. Thanks for your update.