xiaoyeli / superlu_dist

Distributed memory, MPI based SuperLU
https://portal.nersc.gov/project/sparse/superlu/
Other
185 stars 65 forks source link

3D driver pzgssvx3d() with separated factorization and solve calls #77

Open jacobrking opened 3 years ago

jacobrking commented 3 years ago

Hi,

I don't think the factorization and solve phases can be separated in the 3d code in the same way that it can be done in the 2D code. In my application, we call pzgssvx3d() in two separate calls: first with Fact=DOFACT to only do factorization and then later with Fact=FACTORED to solve the system (potentially multiple times). On the second call we do not pass the matrix values. This causes the code to crash when calling zGatherNRformat_loc3d() inside pzgssvx3d(). The solution seems simple - skip the call to zGatherNRformat_loc3d() @ line 562 in SRC/pzgssvx3d.c when Fact=FACTORED but one complication is that A3d is needed to scatter the solution back to the 3D grid.

SRC/pzgssvx3d.c @ line 1565

    /* Scatter the solution from 2D grid_0 to 3D grid */
    zScatter_B3d(A3d, grid3d);

    B = A3d->B3d; // B is now assigned back to B3d on return
    A->Store = Astore3d; // restore Astore to 3D

    /* free A2d and B2d, which are allocated only in 2D layer Grid_0 */
    NRformat_loc *A2d = A3d->A_nfmt;
    if (grid3d->zscp.Iam == 0) {
       SUPERLU_FREE( A2d->rowptr );
       SUPERLU_FREE( A2d->colind );
       SUPERLU_FREE( A2d->nzval );
       SUPERLU_FREE( A3d->B2d );
    }
    SUPERLU_FREE( A2d );         // free 2D structure
    SUPERLU_FREE( A3d );         // free 3D structure

Here are the relevant lines from a stack trace that show the problem:

#6  0x0000000020c6ee53 in zGatherNRformat_loc3d (A=A@entry=0x100014c1520,
    B=B@entry=0x1003f50b120, ldb=ldb@entry=771, nrhs=nrhs@entry=1,
    grid3d=grid3d@entry=0x1000042cf30)
    at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/znrformat_loc3d.c:100
#7  0x0000000020c6c2a8 in pzgssvx3d (options=0x10000d568d0, A=0x100014c1550,
    ScalePermstruct=0x10000f58920, B=B@entry=0x1003f50b120, ldb=771, nrhs=1,
    grid3d=0x1000042cf30, LUstruct=0x10001028010, SOLVEstruct=0x10001030ee0,
    berr=0x7fffffff40c0, stat=0x100010283f0, info=0x7fffffff4398)
    at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/pzgssvx3d.c:562
#8  0x0000000020c062f3 in f_pzgssvx3d_ (options=options@entry=0x10000df4fe0,
    A=A@entry=0x7fffffff40b0,
    ScalePermstruct=ScalePermstruct@entry=0x10000df4fe8,
    B=B@entry=0x1003f50b120, ldb=ldb@entry=0x10000df4bc0,
    nrhs=nrhs@entry=0x2a83ab00, grid=0x4b6d9ed8 <__superlu_dist_mod_MOD_grid>,
    LUstruct=0x10000df4ff0, SOLVEstruct=0x10000df4ff8, berr=0x7fffffff40c0,
    stat=0x7fffffff40b8, info=0x7fffffff4398)
    at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/FORTRAN/superlu_c2f_zwrap.c:155
#9  0x0000000020c051f0 in superlu_dist_z_mod::pzloc_superlu_dist (iopt=3,
    nrow=771, mloc=771, nnz=20169, fstrow=0, vals=...,
    ind=<error reading variable: value requires 80676 bytes, which is more than max-value-size>, ptr=..., rhssln=..., matrix=..., iparm=..., dparm=...,
    info=-47088)
    at /global/u1/j/jakeking/nimall-trunk/nimdevel/extwrappers/superlu_dist_z_mod.F90:166

Thanks.

xiaoyeli commented 3 years ago

In "maint" branch, I restructured driver pxgssvx3d to allow separate calls to Factorization and Solve. The A3d structure is persistent between multiple calls.

Please give a try.

jacobrking commented 3 years ago

Hi Sherry,

I've tested the maint branch and it is better. Thank you.

I've run into a new issue that I believe you can reproduce with a simple modification to the example pzdrive3d1. When I set nrhs=0 on the initial call to pzgssvx3d (so the first call only is for factorization only) the solution on the second call is no longer correct. Can you confirm this?

Index: EXAMPLE/pzdrive3d1.c
===================================================================
--- EXAMPLE/pzdrive3d1.c    (revision 2100)
+++ EXAMPLE/pzdrive3d1.c    (working copy)
@@ -340,7 +340,7 @@
     PStatInit (&stat);

     /* Call the linear equation solver. */
-    pzgssvx3d (&options, &A, &ScalePermstruct, b, ldb, nrhs, &grid,
+    pzgssvx3d (&options, &A, &ScalePermstruct, b, ldb, 0, &grid,
                &LUstruct, &SOLVEstruct, berr, &stat, &info);

Thanks, Jake

jacobrking commented 3 years ago

I should also mention that I needed to make this minor patch:

Index: FORTRAN/superlu_c2f_dwrap.c
===================================================================
--- FORTRAN/superlu_c2f_dwrap.c (revision 2100)
+++ FORTRAN/superlu_c2f_dwrap.c (working copy)
@@ -95,7 +95,7 @@
     dLUstructFree(LUstruct_ptr);
 }

-void f_dDestroy_A3d_gathered_on_2d(fptr *SOLVEstruct, fptr *grid)
+void f_dDestroy_A3d_gathered_on_2d(fptr *SOLVEstruct, fptr *grid3d)
 {
     dDestroy_A3d_gathered_on_2d((dSOLVEstruct_t *) *SOLVEstruct,
                                       (gridinfo3d_t *) *grid3d);
Index: FORTRAN/superlu_c2f_zwrap.c
===================================================================
--- FORTRAN/superlu_c2f_zwrap.c (revision 2100)
+++ FORTRAN/superlu_c2f_zwrap.c (working copy)
@@ -94,7 +94,7 @@
     zLUstructFree(LUstruct_ptr);
 }

-void f_zDestroy_A3d_gathered_on_2d(fptr *SOLVEstruct, fptr *grid)
+void f_zDestroy_A3d_gathered_on_2d(fptr *SOLVEstruct, fptr *grid3d)
 {
     zDestroy_A3d_gathered_on_2d((zSOLVEstruct_t *) *SOLVEstruct,
                                       (gridinfo3d_t *) *grid3d);
xiaoyeli commented 3 years ago

I fixed "nrhs==0" bug, and applied the patch in FORTRAN/ directory. Updated 'maint' branch.

jacobrking commented 3 years ago

Thanks! I'm now getting through the first factorization and solve with the 3d code.

However, it seems that there is another problem when doing a subsequent factorization with the same sparsity pattern. Here's the stack trace in the application:

#1  0x20c5a90e in pzdistribute
    at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/pzdistribute.c:583
#2  0x20c7435f in pzgssvx3d
    at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/pzgssvx3d.c:1192
#3  0x20c044a2 in f_pzgssvx3d_
    at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/FORTRAN/superlu_c2f_zwrap.c:161
#4  0x20c03220 in __superlu_dist_z_mod_MOD_pzloc_superlu_dist
    at /global/u1/j/jakeking/nimall-trunk/nimdevel/extwrappers/superlu_dist_z_mod.F90:163

The issue can also be reproduced in pzdrive3d1 with the following modification:

Index: pzdrive3d1.c
===================================================================
--- pzdrive3d1.c    (revision 2105)
+++ pzdrive3d1.c    (working copy)
@@ -359,7 +359,7 @@
        RIGHT-HAND SIDE,  WE WILL USE THE EXISTING L AND U FACTORS IN
        LUSTRUCT OBTAINED FROM A PREVIOUS FATORIZATION.
        ------------------------------------------------------------*/
-    options.Fact = FACTORED; /* Indicate the factored form of A is supplied. */
+    options.Fact = SamePattern_SameRowPerm; /* Indicate the factored form of A is supplied. */
     PStatInit(&stat); /* Initialize the statistics variables. */

Please let me know if you can confirm.

Thanks, Jake

xiaoyeli commented 3 years ago

"SamePattern" is not yet implemented. I'll do that later today. -Sherry

On Thu, Sep 9, 2021 at 9:44 AM Jacob King @.***> wrote:

Thanks! I'm now getting through the first factorization and solve with the 3d code.

However, it seems that there is another problem when doing a subsequent factorization with the same sparsity pattern. Here's the stack trace in the application:

1 0x20c5a90e in pzdistribute

at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/pzdistribute.c:583

2 0x20c7435f in pzgssvx3d

at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/pzgssvx3d.c:1192

3 0x20c044a2 in fpzgssvx3d

at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/FORTRAN/superlu_c2f_zwrap.c:161

4 0x20c03220 in __superlu_dist_z_mod_MOD_pzloc_superlu_dist

at /global/u1/j/jakeking/nimall-trunk/nimdevel/extwrappers/superlu_dist_z_mod.F90:163

The issue can also be reproduced in pzdrive3d1 with the following modification:

Index: pzdrive3d1.c

--- pzdrive3d1.c (revision 2105) +++ pzdrive3d1.c (working copy) @@ -359,7 +359,7 @@ RIGHT-HAND SIDE, WE WILL USE THE EXISTING L AND U FACTORS IN LUSTRUCT OBTAINED FROM A PREVIOUS FATORIZATION. ------------------------------------------------------------*/

  • options.Fact = FACTORED; / Indicate the factored form of A is supplied. /
  • options.Fact = SamePattern_SameRowPerm; / Indicate the factored form of A is supplied. / PStatInit(&stat); / Initialize the statistics variables. /

Please let me know if you can confirm.

Thanks, Jake

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaoyeli/superlu_dist/issues/77#issuecomment-916264229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZSV5Y7FHE6IRJVQ5HNSVDUBDP7HANCNFSM47EEN2IA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

xiaoyeli commented 3 years ago

@jacobrking Now SamePattern should work, see EXAMPLE/pzdrive3d2.c, still in maint branch.

jacobrking commented 3 years ago

Thanks, Sherry.

Everything is now running through with 4 processors for the 3d layouts that I've tried (nzdep=1, 2 and 4). However, I'm running into trouble with 8 cores. It works with nzdep=1 and 8 but fails during the first solve phase with nzdep=2 and 4 (the factorization succeeds). The stack trace shows the 3d communication code as the culprit:

#7  0x20c765be in zScatter_B3d
    at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/znrformat_loc3d.c:386
#8  0x20c73a59 in pzgssvx3d
    at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/pzgssvx3d.c:1588
#9  0x20c03d12 in f_pzgssvx3d_
    at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/FORTRAN/superlu_c2f_zwrap.c:161

I'm having trouble reproducing the error with the examples. When I use my matrix (/project/projectdirs/nimrod/jakeking/cmatrix.bin), I have trouble with the pzdrive3d2 and pzdrive3d3 that are related to closing the file pointer (fclose(fp);) and reading the matrix from file a second time. The next step for me is to make the matrix layout the same as in the application in pzdrive3d1 but other suggestions are welcome.

jacobrking commented 3 years ago

I've confirmed that this is a matrix layout issue. When I modify the ordering to be the same as the application I get the failures with identical layouts (It works with nzdep=1 and 8 but fails during the first solve phase with nzdep=2 and 4). Below is the modification to zcreate_matrix3d.c that I used to change the pzdrive3d1 program. This specifically must be run with the matrix referenced in my last post (/project/projectdirs/nimrod/jakeking/cmatrix.bin).

Index: zcreate_matrix3d.c
===================================================================
--- zcreate_matrix3d.c  (revision 2106)
+++ zcreate_matrix3d.c  (working copy)
@@ -341,17 +341,44 @@
     nzval[0] = 0.1;
 #endif

+    switch(iam) {
+      case 0:
+       m_loc=111; fst_row=0;
+       break;
+      case 1:
+       m_loc=84; fst_row=111;
+       break;
+      case 2:
+       m_loc=108; fst_row=195;
+       break;
+      case 3:
+       m_loc=84; fst_row=303;
+       break;
+      case 4:
+       m_loc=108; fst_row=387;
+       break;
+      case 5:
+       m_loc=84; fst_row=495;
+       break;
+      case 6:
+       m_loc=108; fst_row=579;
+       break;
+      case 7:
+       m_loc=84; fst_row=687;
+       break;
+     }

     /* Create compressed column matrix for GA. */
     zCreate_CompCol_Matrix_dist(&GA, m, n, nnz, nzval, rowind, colptr,
@@ -379,7 +406,7 @@
         for (j = colptr[i]; j < colptr[i + 1]; ++j) ++marker[rowind[j]];
     /* Set up row pointers */
     rowptr[0] = 0;
-    fst_row = iam * m_loc_fst;
+//    fst_row = iam * m_loc_fst;
     nnz_loc = 0;
     for (j = 0; j < m_loc; ++j)
     {
xiaoyeli commented 3 years ago

In the last comments in issue #83, you said that you already fixed the problem?

Sherry

On Mon, Sep 13, 2021 at 3:35 PM Jacob King @.***> wrote:

I've confirmed that this is a matrix layout issue. When I modify the ordering to be the same as the application I get the failures with identical layouts (It works with nzdep=1 and 8 but fails during the first solve phase with nzdep=2 and 4). Below is the modification to zcreate_matrix3d.c that I used to change the pzdrive3d1 program. This specifically must be run with the matrix references in my last post ( /project/projectdirs/nimrod/jakeking/cmatrix.bin).

Index: zcreate_matrix3d.c

--- zcreate_matrix3d.c (revision 2106) +++ zcreate_matrix3d.c (working copy) @@ -341,17 +341,44 @@ nzval[0] = 0.1;

endif

  • switch(iam) {
  • case 0:
  • m_loc=111; fst_row=0;
  • break;
  • case 1:
  • m_loc=84; fst_row=111;
  • break;
  • case 2:
  • m_loc=108; fst_row=195;
  • break;
  • case 3:
  • m_loc=84; fst_row=303;
  • break;
  • case 4:
  • m_loc=108; fst_row=387;
  • break;
  • case 5:
  • m_loc=84; fst_row=495;
  • break;
  • case 6:
  • m_loc=108; fst_row=579;
  • break;
  • case 7:
  • m_loc=84; fst_row=687;
  • break;
  • }

    / Create compressed column matrix for GA. / zCreate_CompCol_Matrix_dist(&GA, m, n, nnz, nzval, rowind, colptr, @@ -379,7 +406,7 @@ for (j = colptr[i]; j < colptr[i + 1]; ++j) ++marker[rowind[j]]; / Set up row pointers / rowptr[0] = 0;

  • fst_row = iam m_loc_fst; +// fst_row = iam m_loc_fst; nnz_loc = 0; for (j = 0; j < m_loc; ++j) {

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaoyeli/superlu_dist/issues/77#issuecomment-918634339, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZSV563RDINOF4OMBM7C33UBZ4D3ANCNFSM47EEN2IA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jacobrking commented 3 years ago

I fixed a problem with my debugging code (I forgot the break statements). I tested with 4-core runs and ended up back on this issue. I've now moved to 8-core runs with the patches here and found a matrix-layout problem. We could close this issue and re-open that one if you prefer to be more consistent with the issue naming.

xiaoyeli commented 3 years ago

Do you have time to have a quick phone call tomorrow afternoon? -Sherry

On Mon, Sep 13, 2021 at 9:31 PM Jacob King @.***> wrote:

I fixed a problem with my debugging code (I forgot the break statements). I tested with 4-core runs and ended up back on this issue. I've now moved to 8-core runs with the patches here and found a matrix-layout problem. We could close this issue and re-open that one if you prefer to be more consistent with the issue naming.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaoyeli/superlu_dist/issues/77#issuecomment-918792279, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZSV5YTJ5M6LEV65WFSKITUB3F2TANCNFSM47EEN2IA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.