yogevb / a-dda

Automatically exported from code.google.com/p/a-dda
0 stars 0 forks source link

Regression of sparse mode (current 1.3b version) #201

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
1. Download 1.3b4 or update the current svn revision (1352)
2. make mpi enabling SPARSE and USE_SSE3
3. Run with different numbers of mpi processes on a shape.geom file (I used the 
standard sphere.geom generated with ./adda -save_geom)

I expect to obtain the same result with any number of MPI process, but it seems 
that only with -np 1 the results are correct

Only SPARSE mode is affected by this regression.
Previous stable 1.2 release is NOT affected by this problem.

I provide my results in the attached results.zip file
The four different results are obtained with the following commands 
respectively:

fft_np_1: mpirun -np 1 ./adda_mpi -shape read sphere.geom
fft_np_2: mpirun -np 2 ./adda_mpi -shape read sphere.geom
sparse_np_1: mpirun -np 1 ./adda_spa_mpi -shape read sphere.geom
sparse_np_2: mpirun -np 2 ./adda_spa_mpi -shape read sphere.geom

Obviously adda_spa_mpi executable is compiled with the SPARSE USE_SSE3 options 
and the adda_mpi is compiled without these options

ADDITIONAL INFO:
- gcc 4.8.2
- openmpi 1.6.5

Original issue reported on code.google.com by davide.o...@gmail.com on 13 Jul 2014 at 8:32

Attachments:

GoogleCodeExporter commented 9 years ago
That is strange indeed. I've tried it (r1352) on different systems:
1) Windows laptop, gcc 4.8.1, MPICH2 1.4.1p1 (here I used 'mpiexec -n 2' 
instead of 'mpirun -np 2').
2) Unix cluster, gcc 4.7.0, openmpi 1.4.5
3) Unix cluster, gcc 4.3.2, openmpi 1.4.3
and could not reproduce the issue (everything is working fine).

Therefore, I can only suggest a number of (crazy) tests/ideas at the moment.
- make sure that you do not have modified files in your source (run svn diff).
- run './adda_mpi -V' and show the output (down to GPL text).
- further let's simplify the test runs a little bit. Attached is sphere4.geom 
(obtained by 'adda -grid 4 -save_geom'). I propose the command line:
... -shape read sphere4.geom -sym enf
- play with compile flags. Remove USE_SSE3 (test) then additionally add 
DEBUGFULL (test).
- Then use the latter debug version to save as much as possible, like:
... -shape read sphere4.geom -sym enf -store_beam > out
and show here all the output (including stdout).
- If that is possible, please also try different (earlier) versions of gcc 
and/or openmpi.
- Just in case (should not make any difference) try using 'mpiexec -n 2' 
instead of 'mpirun -np 2'.

Original comment by yurkin on 14 Jul 2014 at 6:14

Attachments:

GoogleCodeExporter commented 9 years ago
Your suggestions were not crazy at all.
I found the problem.

After many tests (with different gcc versions and compiler options) I tried to 
compile an updated version of openmpi (1.8.1).
Then I was able to get the right results using only the updated mpirun (without 
recompiling adda).
This was the most annoying part of the tests because the official Ubuntu 14.04 
repository only have the 1.6 version of openmpi, thus, I had to compile the 
source of the updated version.

I think that in the end this issue may be useful.
The current Ubuntu 14.04 LTS repository only have openmpi-1.6 which seems the 
reason of my crazy results.

Even if sparse mode is not the most frequently used option for adda, the ubuntu 
14.04 users should be aware that they have to prefer mpich or a selfcompiled 
openmpi library.

Of course, I am available for further investigations on this problem.

Original comment by davide.o...@gmail.com on 14 Jul 2014 at 5:16

GoogleCodeExporter commented 9 years ago
OK, we have a similar problem with gcc 4.6.2 (also comes with Ubuntu LTS, but 
12.04) - issue 194, which was unsolvable on our side. Now we have another with 
openmpi...

Davide, can you please localize the bug to a particular version (or range of 
versions) of openmpi. Ideally, it would be great if we can connect it to a 
certain bug at openmpi issue tracker, but that may be hard to pinpoint.

Also, please run the debug version with storing incident beam, as I mentioned 
earlier, and report the output. I want to localize the problem in ADDA source 
as well. Maybe (though unlikely) it would be possible to arrange some MPI code 
to make it operational under faulty openmpi as well... 

Original comment by yurkin on 14 Jul 2014 at 6:47

GoogleCodeExporter commented 9 years ago
Ok I have run some tests.
Find all results in the attached ompi_check.zip

In the zipped file you'll find openmpi_check.txt which lists my openmpi tests 
with various library versions (all library and adda are compiled with gcc-4.8).
Basically I found that the problem emerge with openmpi-1.5.4 and last until 
openmpi-1.7.2
The first openmpi version NOT affected by the bug is 1.7.3

I've also run DEBUGFULL with 1 and 2 MPI processes and for openmpi version 
1.6.5 (affected) and 1.8.1 (not affected).
All executables are compiled with gcc-4.8, DEBUGFULL and runned with 
... -shape read sphere4.geom -sym enf -store_beam > out...

You'll find all the results in the corresponding directories
run-OMPI_VERSION-np#/

Inside each directory you also find the corresponding output
out-OMPI_VERSION-np#

Original comment by davide.o...@gmail.com on 17 Jul 2014 at 10:14

Attachments:

GoogleCodeExporter commented 9 years ago
I don't know, how the problem appeared in OpenMPI, but it seems to be fixed by 
this revision - https://svn.open-mpi.org/trac/ompi/changeset/29187 , and then 
it was incorporated into 1.7.3 - https://svn.open-mpi.org/trac/ompi/ticket/3772 
. 

In ADDA the error probably appears during call to 
MPI_Allgatherv(MPI_IN_PLACE,0,...), which in turn happen only during calls of 
AllGather(NULL,...). The latter are only called in Sparse MPI code in two 
places - matvec.c (for matrix-vector product) and in make_particle.c (to make 
position_full).

Interesting, that both this calls should become irrelevant if issue 160 is 
implemented. But for now we need some workaround, or at least a meaningful 
warning/error message when faulty openmpi is used. 

Original comment by yurkin on 18 Jul 2014 at 9:14

GoogleCodeExporter commented 9 years ago
Maybe I am missing something, but that parts of the code are left unchanged 
from adda-1.2 to the current 1.3 revision.
If the problem appears during MPI_Allgatherv it should also be present in the 
past realese of adda. Am I correct?
On the contrary adda-1.2 is not affected by the this problem.

After some tests I have localized that the problems appears with adda-r1253 (at 
least with ompi-1.6.5, I did not test with other versions of ompi).

Original comment by davide.o...@gmail.com on 18 Jul 2014 at 12:12

GoogleCodeExporter commented 9 years ago
Davide, that is an important comment. I forgot about dependence on ADDA 
version. Then it seems that MPI_Allgatherv only have problems for complex 
built-in datatypes like MPI_C_DOUBLE_COMPLEX. Let's check it.

Please, change in parbas.h 
#ifdef MPI_C_DOUBLE_COMPLEX
to
#if 0

Then recompile and test with ompi-1.6.5.

Original comment by yurkin on 21 Jul 2014 at 4:18

GoogleCodeExporter commented 9 years ago
Checked!
Everything is fine with this change and ompi-1.6.5
Maybe it is possible to avoid the problem adding an ompi version checking in 
parbas.h?

Original comment by davide.o...@gmail.com on 21 Jul 2014 at 9:15

GoogleCodeExporter commented 9 years ago
It seems it was partly my fault after all. Davide, please test the recent 
r1355, it should fix the problem. It would be great if you can test the 
boundary cases (1.7.2 and 1.7.3), and also compiling ADDA using one version of 
OpenMPI then executing with another. ADDA should either work correctly or 
produce a meaningful error message.

Original comment by yurkin on 21 Jul 2014 at 10:58

GoogleCodeExporter commented 9 years ago
Everything works fine except mpirun-1.7.2 running adda compiled with 1.7.3 
which produces the following message as expected

ERROR: (../comm.c:296) MPI library version (2.1) is too old for current ADDA
executable. Version 2.2 or newer is required. Alternatively, you may recompile
ADDA using this version of the library.

Original comment by davide.o...@gmail.com on 21 Jul 2014 at 2:51

GoogleCodeExporter commented 9 years ago
Great, thanks for your efforts.

Original comment by yurkin on 21 Jul 2014 at 3:05