Optimize 3D FFT - Githubissues

There are a number of ideas, which could be used to optimize the 3D FFT 
transform, using modern FFT libraries (like FFTW or clAmdFft).

1) Full 3D transpose can be performed (including MPI mode) using the built-in 
functionality of the FFTW. This will be conceptually very simple and let the 
library to unleash its full optimization power. But, it will increase the 
computational load by almost two times, due to redundant transformations of 
zeros. See http://www.fftw.org/pruned.html . In particular, there is a beta 
version of sparsefft to address this problem, but it doesn't have MPI support.  

This is also possible for clAmdFft, but there the total FFT size (product of 
dimensions) is currently limited by 2^24, which can be an issue with modern 
GPUs.

2) 2D transforms (in slices) can be done in one go (with the same compromises 
as described above). This is especially promising for clAmdFft since there we 
are currently effectively performing the full 2D FFTs (without regard for 
zeroes) due to limited flexibility of the library.

3) MPI transposes (global communications) can be made by the FFTW. Local 
transposes (inside slices) can also be done by the FFTW - see 
http://www.fftw.org/faq/section3.html#transpose . Probably, something like that 
can be devised for clAmdFft.

4) It is also possible to significantly optimize non-FFT parts by restructuring 
arrays (especially inside slices). For instance different components can be 
made contiguous. This will, however, make FFTs "strided".
Original issue reported on code.google.com by yurkin on 21 Jan 2013 at 6:44
wsy2220 / a-dda

Optimize 3D FFT #158