Open sangallidavide opened 2 months ago
With commit 4cd69a6 the timing of OpenAcc goes down to 3 sec, still longer than CudaFortran
Attached the changes of the commit, since it is inside a branch of the private report
a) data present
moved outside the freq_loop
b) collapse(2)
added around the loop also reported in the next message
diff --git a/src/pol_function/X_irredux.F b/src/pol_function/X_irredux.F
index 13bc7d503e..55ebf0c84b 100644
--- a/src/pol_function/X_irredux.F
+++ b/src/pol_function/X_irredux.F
@@ -312,6 +312,7 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
!
! 3) Finally multiply residual and frequency dependent term
!===========================================================
+ !DEV_ACC data present(X_par_p,Xo_res_p,X_par_lowtri_p,Xo_res_p)
freq_loop:&
do iw=1,Xw%n_freqs
!
@@ -324,8 +325,7 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
!
GreenF_iw=GreenF(iw)
!
- !DEV_ACC data present(X_par_p,Xo_res_p)
- !DEV_ACC parallel loop
+ !DEV_ACC parallel loop collapse(2)
!DEV_CUF kernel do(2) <<<*,*>>>
!DEV_OMPGPU target map(present,alloc:X_par_p,Xo_res_p)
!DEV_OMPGPU teams loop collapse(2)
@@ -335,7 +335,6 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
enddo
enddo
!DEV_OMPGPU end target
- !DEV_ACC end data
!
#else
!
@@ -372,7 +371,6 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
!
if (PAR_COM_RL_INDEX_n_CPU>1) then
!
- !DEV_ACC data present(X_par_lowtri_p,Xo_res_p)
!DEV_ACC parallel loop collapse(2)
!DEV_CUF kernel do(2)
!DEV_OMPGPU target map(present,alloc:X_par_lowtri_p,Xo_res_p)
@@ -389,11 +387,9 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
enddo
enddo
!DEV_OMPGPU end target
- !DEV_ACC end data
!
else
!
- !DEV_ACC data present(X_par_p,Xo_res_p)
!DEV_ACC parallel loop collapse(2)
!DEV_CUF kernel do(2)
!DEV_OMPGPU target map(present,alloc:X_par_p,Xo_res_p)
@@ -409,7 +405,6 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
enddo
enddo
!DEV_OMPGPU end target
- !DEV_ACC end data
!
endif
#else
@@ -450,6 +445,7 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
endif
!
enddo freq_loop
+ !DEV_ACC end data
!
if (master_thread) call live_timing(steps=1)
!
The source of the residual slowdown is the following loop (src/pol_function/X_irredux.F
)
!DEV_ACC parallel loop collapse(2)
!DEV_CUF kernel do(2) <<<*,*>>>
!DEV_OMPGPU target map(present,alloc:X_par_p,Xo_res_p)
!DEV_OMPGPU teams loop collapse(2)
do ig_col=X_cols1,X_cols2
do ig1=X_rows1,X_rows2
if (ig1 <= ig_col) X_par_p(ig1,ig_col,iw)=X_par_p(ig1,ig_col,iw)+GreenF_iw*Xo_res_p(ig1,ig_col)
enddo
enddo
!DEV_OMPGPU end target
CudaF is more efficient than OpenAcc here
A possible solution is to alloc a table of indexes at the beginning of the subroutine
i1=0
do ig_col=X_cols1,X_cols2
do ig1=X_rows1,ig_col
i1=i1+1
gg_table(:,i1)=(ig_col,ig1)
enddo
enddo
ng=i1
and later use it inside the frequency loop
do i1=1,ng
ig_col=gg_table(1,i1)
ig2=gg_table(2,i1)
X_par_p(ig1,ig_col,iw)=X_par_p(ig1,ig_col,iw)+GreenF_iw*Xo_res_p(ig1,ig_col)
enddo
Looking at the original loop:
!DEV_ACC parallel loop collapse(2)
!DEV_CUF kernel do(2) <<<*,*>>>
!DEV_OMPGPU target map(present,alloc:X_par_p,Xo_res_p)
!DEV_OMPGPU teams loop collapse(2)
do ig_col=X_cols1,X_cols2
do ig1=X_rows1,X_rows2
if (ig1 <= ig_col) X_par_p(ig1,ig_col,iw)=X_par_p(ig1,ig_col,iw)+GreenF_iw*Xo_res_p(ig1,ig_col)
enddo
enddo
!DEV_OMPGPU end target
I think this development came with the CUDAF implementation, where instead of having a dependence among the two loops:
do ig_col=X_cols1,X_cols2
do ig1=X_rows1,min(ig_col,X_rows2)
X_par_p(ig1,ig_col,iw)=X_par_p(ig1,ig_col,iw)+GreenF_iw*Xo_res_p(ig1,ig_col)
enddo
enddo
it was preferred to have two independent loops with a if condition (code above). This is still done in the cpu part, but taken care using the caxpy routine.
Since this is performance critical, one could even think of keeping a different version for CUDAF and OpenACC
Using the 02_eels test of Al111 in test-suite, CudaFortran run in less then one second, while OpenAcc takes about 13 second.
Tests done on branch tech/deve-gpu