xbpeng / DeepMimic

Motion imitation with deep reinforcement learning.
https://xbpeng.github.io/projects/DeepMimic/index.html
MIT License
2.31k stars 488 forks source link

MPI "Unable to start a daemon on the local node" #85

Closed vkozin97 closed 5 years ago

vkozin97 commented 5 years ago

Hi! I am trying to run the DeepMimic.py demo for spinkick inside a docker container over Ubuntu-18.04. I have already successfully built all the libraries and then the DeepMimicCore.py. Now when i'm typing python3 DeepMimic.py --arg_file args/run_humanoid3d_spinkick_args.txt, the following errors of MPI occur:

root@zcluster4:/localDeepMimic/DeepMimic-master# python3 DeepMimic.py --arg_file args/run_humanoid3d_spinkick_args.txt
--------------------------------------------------------------------------
The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:

  plm_rsh_agent: ssh : rsh

Please either unset the parameter, or check that the path is correct
--------------------------------------------------------------------------
[zcluster4:22139] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 582
[zcluster4:22139] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 166
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[zcluster4:22139] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

My sudo apt install libopenmpi-dev and pip3 install mpi4py worked without any errors. My mpi4py info is here:

root@zcluster4:/localDeepMimic/DeepMimic-master# pip3 show mpi4py
Name: mpi4py
Version: 3.0.2
Summary: Python bindings for MPI
Home-page: https://bitbucket.org/mpi4py/mpi4py/
Author: Lisandro Dalcin
Author-email: dalcinl@gmail.com
License: BSD
Location: /usr/local/lib/python3.6/dist-packages
Requires: 

Any ideas of how to fix this?

vkozin97 commented 5 years ago

Oh, seems like i resolved this particular issue by sudo apt install ssh. Closing

sd707589 commented 3 years ago

@vkozin97 Can you share your Docker project? Thx