OpenAcc slowdown compared to CudaFortran in X_irredux

sangallidavide commented 2 months ago

Using the 02_eels test of Al111 in test-suite, CudaFortran run in less then one second, while OpenAcc takes about 13 second.

Tests done on branch tech/deve-gpu

sangallidavide commented 2 months ago

With commit 4cd69a6 the timing of OpenAcc goes down to 3 sec, still longer than CudaFortran

Attached the changes of the commit, since it is inside a branch of the private report a) data present moved outside the freq_loop b) collapse(2) added around the loop also reported in the next message

diff --git a/src/pol_function/X_irredux.F b/src/pol_function/X_irredux.F
index 13bc7d503e..55ebf0c84b 100644
--- a/src/pol_function/X_irredux.F
+++ b/src/pol_function/X_irredux.F
@@ -312,6 +312,7 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
    !
    ! 3) Finally multiply residual and frequency dependent term
    !===========================================================
+   !DEV_ACC data present(X_par_p,Xo_res_p,X_par_lowtri_p,Xo_res_p)
    freq_loop:&
    do iw=1,Xw%n_freqs
      !
@@ -324,8 +325,7 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
      !
      GreenF_iw=GreenF(iw)
      !
-     !DEV_ACC data present(X_par_p,Xo_res_p)
-     !DEV_ACC parallel loop 
+     !DEV_ACC parallel loop collapse(2)
      !DEV_CUF kernel do(2) <<<*,*>>>
      !DEV_OMPGPU target map(present,alloc:X_par_p,Xo_res_p)
      !DEV_OMPGPU teams loop collapse(2)
@@ -335,7 +335,6 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
        enddo
      enddo
      !DEV_OMPGPU end target
-     !DEV_ACC end data
      !
 #else
      !
@@ -372,7 +371,6 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
        !
        if (PAR_COM_RL_INDEX_n_CPU>1) then
          !
-         !DEV_ACC data present(X_par_lowtri_p,Xo_res_p)
          !DEV_ACC parallel loop collapse(2) 
          !DEV_CUF kernel do(2)
          !DEV_OMPGPU target map(present,alloc:X_par_lowtri_p,Xo_res_p)
@@ -389,11 +387,9 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
          enddo
          enddo
          !DEV_OMPGPU end target
-         !DEV_ACC end data
          !
        else
          !
-         !DEV_ACC data present(X_par_p,Xo_res_p)
          !DEV_ACC parallel loop collapse(2) 
          !DEV_CUF kernel do(2)
          !DEV_OMPGPU target map(present,alloc:X_par_p,Xo_res_p)
@@ -409,7 +405,6 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
          enddo
          enddo
          !DEV_OMPGPU end target
-         !DEV_ACC end data
          !
        endif
 #else
@@ -450,6 +445,7 @@ subroutine X_irredux(iq,what,X_par,Xen,Xk,Xw,X,Dip)
      endif
      !
    enddo freq_loop
+   !DEV_ACC end data
    !
    if (master_thread) call live_timing(steps=1)
    !

sangallidavide commented 2 months ago

The source of the residual slowdown is the following loop (src/pol_function/X_irredux.F)

     !DEV_ACC parallel loop collapse(2)                                 
     !DEV_CUF kernel do(2) <<<*,*>>>                                    
     !DEV_OMPGPU target map(present,alloc:X_par_p,Xo_res_p)             
     !DEV_OMPGPU teams loop collapse(2)                                 
     do ig_col=X_cols1,X_cols2                                          
       do ig1=X_rows1,X_rows2                                           
         if (ig1 <= ig_col) X_par_p(ig1,ig_col,iw)=X_par_p(ig1,ig_col,iw)+GreenF_iw*Xo_res_p(ig1,ig_col)
       enddo                                                            
     enddo                                                              
     !DEV_OMPGPU end target

CudaF is more efficient than OpenAcc here

sangallidavide commented 2 months ago

A possible solution is to alloc a table of indexes at the beginning of the subroutine

    i1=0
     do ig_col=X_cols1,X_cols2                                          
       do ig1=X_rows1,ig_col
         i1=i1+1
         gg_table(:,i1)=(ig_col,ig1)                                        
       enddo                                                            
     enddo                           
     ng=i1

and later use it inside the frequency loop

  do i1=1,ng
    ig_col=gg_table(1,i1)
    ig2=gg_table(2,i1)
    X_par_p(ig1,ig_col,iw)=X_par_p(ig1,ig_col,iw)+GreenF_iw*Xo_res_p(ig1,ig_col)
  enddo

andrea-ferretti commented 1 month ago

Looking at the original loop:

     !DEV_ACC parallel loop collapse(2)                                 
     !DEV_CUF kernel do(2) <<<*,*>>>                                    
     !DEV_OMPGPU target map(present,alloc:X_par_p,Xo_res_p)             
     !DEV_OMPGPU teams loop collapse(2)                                 
     do ig_col=X_cols1,X_cols2                                          
       do ig1=X_rows1,X_rows2                                           
         if (ig1 <= ig_col) X_par_p(ig1,ig_col,iw)=X_par_p(ig1,ig_col,iw)+GreenF_iw*Xo_res_p(ig1,ig_col)
       enddo                                                            
     enddo                                                              
     !DEV_OMPGPU end target

I think this development came with the CUDAF implementation, where instead of having a dependence among the two loops:

     do ig_col=X_cols1,X_cols2                                          
       do ig1=X_rows1,min(ig_col,X_rows2)                                           
            X_par_p(ig1,ig_col,iw)=X_par_p(ig1,ig_col,iw)+GreenF_iw*Xo_res_p(ig1,ig_col)
       enddo                                                            
     enddo

it was preferred to have two independent loops with a if condition (code above). This is still done in the cpu part, but taken care using the caxpy routine.

Since this is performance critical, one could even think of keeping a different version for CUDAF and OpenACC

yambo-code / yambo

OpenAcc slowdown compared to CudaFortran in X_irredux #79