Open jacobrking opened 3 years ago
In "maint" branch, I restructured driver pxgssvx3d to allow separate calls to Factorization and Solve. The A3d structure is persistent between multiple calls.
Please give a try.
Hi Sherry,
I've tested the maint branch and it is better. Thank you.
I've run into a new issue that I believe you can reproduce with a simple modification to the example pzdrive3d1
. When I set nrhs=0
on the initial call to pzgssvx3d
(so the first call only is for factorization only) the solution on the second call is no longer correct. Can you confirm this?
Index: EXAMPLE/pzdrive3d1.c
===================================================================
--- EXAMPLE/pzdrive3d1.c (revision 2100)
+++ EXAMPLE/pzdrive3d1.c (working copy)
@@ -340,7 +340,7 @@
PStatInit (&stat);
/* Call the linear equation solver. */
- pzgssvx3d (&options, &A, &ScalePermstruct, b, ldb, nrhs, &grid,
+ pzgssvx3d (&options, &A, &ScalePermstruct, b, ldb, 0, &grid,
&LUstruct, &SOLVEstruct, berr, &stat, &info);
Thanks, Jake
I should also mention that I needed to make this minor patch:
Index: FORTRAN/superlu_c2f_dwrap.c
===================================================================
--- FORTRAN/superlu_c2f_dwrap.c (revision 2100)
+++ FORTRAN/superlu_c2f_dwrap.c (working copy)
@@ -95,7 +95,7 @@
dLUstructFree(LUstruct_ptr);
}
-void f_dDestroy_A3d_gathered_on_2d(fptr *SOLVEstruct, fptr *grid)
+void f_dDestroy_A3d_gathered_on_2d(fptr *SOLVEstruct, fptr *grid3d)
{
dDestroy_A3d_gathered_on_2d((dSOLVEstruct_t *) *SOLVEstruct,
(gridinfo3d_t *) *grid3d);
Index: FORTRAN/superlu_c2f_zwrap.c
===================================================================
--- FORTRAN/superlu_c2f_zwrap.c (revision 2100)
+++ FORTRAN/superlu_c2f_zwrap.c (working copy)
@@ -94,7 +94,7 @@
zLUstructFree(LUstruct_ptr);
}
-void f_zDestroy_A3d_gathered_on_2d(fptr *SOLVEstruct, fptr *grid)
+void f_zDestroy_A3d_gathered_on_2d(fptr *SOLVEstruct, fptr *grid3d)
{
zDestroy_A3d_gathered_on_2d((zSOLVEstruct_t *) *SOLVEstruct,
(gridinfo3d_t *) *grid3d);
I fixed "nrhs==0" bug, and applied the patch in FORTRAN/ directory. Updated 'maint' branch.
Thanks! I'm now getting through the first factorization and solve with the 3d code.
However, it seems that there is another problem when doing a subsequent factorization with the same sparsity pattern. Here's the stack trace in the application:
#1 0x20c5a90e in pzdistribute
at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/pzdistribute.c:583
#2 0x20c7435f in pzgssvx3d
at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/pzgssvx3d.c:1192
#3 0x20c044a2 in f_pzgssvx3d_
at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/FORTRAN/superlu_c2f_zwrap.c:161
#4 0x20c03220 in __superlu_dist_z_mod_MOD_pzloc_superlu_dist
at /global/u1/j/jakeking/nimall-trunk/nimdevel/extwrappers/superlu_dist_z_mod.F90:163
The issue can also be reproduced in pzdrive3d1
with the following modification:
Index: pzdrive3d1.c
===================================================================
--- pzdrive3d1.c (revision 2105)
+++ pzdrive3d1.c (working copy)
@@ -359,7 +359,7 @@
RIGHT-HAND SIDE, WE WILL USE THE EXISTING L AND U FACTORS IN
LUSTRUCT OBTAINED FROM A PREVIOUS FATORIZATION.
------------------------------------------------------------*/
- options.Fact = FACTORED; /* Indicate the factored form of A is supplied. */
+ options.Fact = SamePattern_SameRowPerm; /* Indicate the factored form of A is supplied. */
PStatInit(&stat); /* Initialize the statistics variables. */
Please let me know if you can confirm.
Thanks, Jake
"SamePattern" is not yet implemented. I'll do that later today. -Sherry
On Thu, Sep 9, 2021 at 9:44 AM Jacob King @.***> wrote:
Thanks! I'm now getting through the first factorization and solve with the 3d code.
However, it seems that there is another problem when doing a subsequent factorization with the same sparsity pattern. Here's the stack trace in the application:
1 0x20c5a90e in pzdistribute
at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/pzdistribute.c:583
2 0x20c7435f in pzgssvx3d
at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/pzgssvx3d.c:1192
3 0x20c044a2 in fpzgssvx3d
at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/FORTRAN/superlu_c2f_zwrap.c:161
4 0x20c03220 in __superlu_dist_z_mod_MOD_pzloc_superlu_dist
at /global/u1/j/jakeking/nimall-trunk/nimdevel/extwrappers/superlu_dist_z_mod.F90:163
The issue can also be reproduced in pzdrive3d1 with the following modification:
Index: pzdrive3d1.c
--- pzdrive3d1.c (revision 2105) +++ pzdrive3d1.c (working copy) @@ -359,7 +359,7 @@ RIGHT-HAND SIDE, WE WILL USE THE EXISTING L AND U FACTORS IN LUSTRUCT OBTAINED FROM A PREVIOUS FATORIZATION. ------------------------------------------------------------*/
- options.Fact = FACTORED; / Indicate the factored form of A is supplied. /
- options.Fact = SamePattern_SameRowPerm; / Indicate the factored form of A is supplied. / PStatInit(&stat); / Initialize the statistics variables. /
Please let me know if you can confirm.
Thanks, Jake
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaoyeli/superlu_dist/issues/77#issuecomment-916264229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZSV5Y7FHE6IRJVQ5HNSVDUBDP7HANCNFSM47EEN2IA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@jacobrking Now SamePattern should work, see EXAMPLE/pzdrive3d2.c, still in maint branch.
Thanks, Sherry.
Everything is now running through with 4 processors for the 3d layouts that I've tried (nzdep=1, 2 and 4). However, I'm running into trouble with 8 cores. It works with nzdep=1 and 8 but fails during the first solve phase with nzdep=2 and 4 (the factorization succeeds). The stack trace shows the 3d communication code as the culprit:
#7 0x20c765be in zScatter_B3d
at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/znrformat_loc3d.c:386
#8 0x20c73a59 in pzgssvx3d
at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/SRC/pzgssvx3d.c:1588
#9 0x20c03d12 in f_pzgssvx3d_
at /global/u1/j/jakeking/nimall-trunk/superlu_dist7/FORTRAN/superlu_c2f_zwrap.c:161
I'm having trouble reproducing the error with the examples. When I use my matrix (/project/projectdirs/nimrod/jakeking/cmatrix.bin
), I have trouble with the pzdrive3d2 and pzdrive3d3 that are related to closing the file pointer (fclose(fp);
) and reading the matrix from file a second time. The next step for me is to make the matrix layout the same as in the application in pzdrive3d1
but other suggestions are welcome.
I've confirmed that this is a matrix layout issue. When I modify the ordering to be the same as the application I get the failures with identical layouts (It works with nzdep=1 and 8 but fails during the first solve phase with nzdep=2 and 4). Below is the modification to zcreate_matrix3d.c
that I used to change the pzdrive3d1
program. This specifically must be run with the matrix referenced in my last post (/project/projectdirs/nimrod/jakeking/cmatrix.bin
).
Index: zcreate_matrix3d.c
===================================================================
--- zcreate_matrix3d.c (revision 2106)
+++ zcreate_matrix3d.c (working copy)
@@ -341,17 +341,44 @@
nzval[0] = 0.1;
#endif
+ switch(iam) {
+ case 0:
+ m_loc=111; fst_row=0;
+ break;
+ case 1:
+ m_loc=84; fst_row=111;
+ break;
+ case 2:
+ m_loc=108; fst_row=195;
+ break;
+ case 3:
+ m_loc=84; fst_row=303;
+ break;
+ case 4:
+ m_loc=108; fst_row=387;
+ break;
+ case 5:
+ m_loc=84; fst_row=495;
+ break;
+ case 6:
+ m_loc=108; fst_row=579;
+ break;
+ case 7:
+ m_loc=84; fst_row=687;
+ break;
+ }
/* Create compressed column matrix for GA. */
zCreate_CompCol_Matrix_dist(&GA, m, n, nnz, nzval, rowind, colptr,
@@ -379,7 +406,7 @@
for (j = colptr[i]; j < colptr[i + 1]; ++j) ++marker[rowind[j]];
/* Set up row pointers */
rowptr[0] = 0;
- fst_row = iam * m_loc_fst;
+// fst_row = iam * m_loc_fst;
nnz_loc = 0;
for (j = 0; j < m_loc; ++j)
{
In the last comments in issue #83, you said that you already fixed the problem?
Sherry
On Mon, Sep 13, 2021 at 3:35 PM Jacob King @.***> wrote:
I've confirmed that this is a matrix layout issue. When I modify the ordering to be the same as the application I get the failures with identical layouts (It works with nzdep=1 and 8 but fails during the first solve phase with nzdep=2 and 4). Below is the modification to zcreate_matrix3d.c that I used to change the pzdrive3d1 program. This specifically must be run with the matrix references in my last post ( /project/projectdirs/nimrod/jakeking/cmatrix.bin).
Index: zcreate_matrix3d.c
--- zcreate_matrix3d.c (revision 2106) +++ zcreate_matrix3d.c (working copy) @@ -341,17 +341,44 @@ nzval[0] = 0.1;
endif
- switch(iam) {
- case 0:
- m_loc=111; fst_row=0;
- break;
- case 1:
- m_loc=84; fst_row=111;
- break;
- case 2:
- m_loc=108; fst_row=195;
- break;
- case 3:
- m_loc=84; fst_row=303;
- break;
- case 4:
- m_loc=108; fst_row=387;
- break;
- case 5:
- m_loc=84; fst_row=495;
- break;
- case 6:
- m_loc=108; fst_row=579;
- break;
- case 7:
- m_loc=84; fst_row=687;
- break;
}
/ Create compressed column matrix for GA. / zCreate_CompCol_Matrix_dist(&GA, m, n, nnz, nzval, rowind, colptr, @@ -379,7 +406,7 @@ for (j = colptr[i]; j < colptr[i + 1]; ++j) ++marker[rowind[j]]; / Set up row pointers / rowptr[0] = 0;
- fst_row = iam m_loc_fst; +// fst_row = iam m_loc_fst; nnz_loc = 0; for (j = 0; j < m_loc; ++j) {
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaoyeli/superlu_dist/issues/77#issuecomment-918634339, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZSV563RDINOF4OMBM7C33UBZ4D3ANCNFSM47EEN2IA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I fixed a problem with my debugging code (I forgot the break statements). I tested with 4-core runs and ended up back on this issue. I've now moved to 8-core runs with the patches here and found a matrix-layout problem. We could close this issue and re-open that one if you prefer to be more consistent with the issue naming.
Do you have time to have a quick phone call tomorrow afternoon? -Sherry
On Mon, Sep 13, 2021 at 9:31 PM Jacob King @.***> wrote:
I fixed a problem with my debugging code (I forgot the break statements). I tested with 4-core runs and ended up back on this issue. I've now moved to 8-core runs with the patches here and found a matrix-layout problem. We could close this issue and re-open that one if you prefer to be more consistent with the issue naming.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaoyeli/superlu_dist/issues/77#issuecomment-918792279, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZSV5YTJ5M6LEV65WFSKITUB3F2TANCNFSM47EEN2IA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Hi,
I don't think the factorization and solve phases can be separated in the 3d code in the same way that it can be done in the 2D code. In my application, we call
pzgssvx3d()
in two separate calls: first withFact=DOFACT
to only do factorization and then later withFact=FACTORED
to solve the system (potentially multiple times). On the second call we do not pass the matrix values. This causes the code to crash when callingzGatherNRformat_loc3d()
insidepzgssvx3d()
. The solution seems simple - skip the call tozGatherNRformat_loc3d()
@ line 562 inSRC/pzgssvx3d.c
whenFact=FACTORED
but one complication is that A3d is needed to scatter the solution back to the 3D grid.SRC/pzgssvx3d.c
@ line 1565Here are the relevant lines from a stack trace that show the problem:
Thanks.