yjxiong / caffe

A fork of Caffe with OpenMPI-based Multi-GPU (mainly data parallel) support for action recognition and more. More documentation please see the original readme.
http://caffe.berkeleyvision.org/
Other
551 stars 153 forks source link

import caffe raise error using MPI enabled caffe #197

Closed hiyijian closed 6 years ago

hiyijian commented 6 years ago

Dear @yjxiong , I compiled your caffe with -DUSE_MPI=ON and everything works like a charm, except python interface.

I made a simple python script, say test.py, with a single line: import caffe "mpirun -n 2 python test.py" raise error:

[003761c78f69:00470] mca_base_component_repository_open: unable to open mca_patcher_overwrite: /usr/local/mpi/lib/openmpi/mca_patcher_overwrite.so: undefined symbol: mca_patcher_base_patch_t_class (ignored)
[003761c78f69:00469] mca_base_component_repository_open: unable to open mca_patcher_overwrite: /usr/local/mpi/lib/openmpi/mca_patcher_overwrite.so: undefined symbol: mca_patcher_base_patch_t_class (ignored)
[003761c78f69:00470] mca_base_component_repository_open: unable to open mca_shmem_sysv: /usr/local/mpi/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
[003761c78f69:00469] mca_base_component_repository_open: unable to open mca_shmem_sysv: /usr/local/mpi/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
[003761c78f69:00469] mca_base_component_repository_open: unable to open mca_shmem_mmap: /usr/local/mpi/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
[003761c78f69:00470] mca_base_component_repository_open: unable to open mca_shmem_mmap: /usr/local/mpi/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
[003761c78f69:00469] mca_base_component_repository_open: unable to open mca_shmem_posix: /usr/local/mpi/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
[003761c78f69:00470] mca_base_component_repository_open: unable to open mca_shmem_posix: /usr/local/mpi/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------

Do you have any idea? Thank you:D

hiyijian commented 6 years ago

I found a grace way to avoid this.

I will raise a PR later.

thanks