There are a number of ideas, which could be used to optimize the 3D FFT
transform, using modern FFT libraries (like FFTW or clAmdFft).
1) Full 3D transpose can be performed (including MPI mode) using the built-in
functionality of the FFTW. This will be conceptually very simple and let the
library to unleash its full optimization power. But, it will increase the
computational load by almost two times, due to redundant transformations of
zeros. See http://www.fftw.org/pruned.html . In particular, there is a beta
version of sparsefft to address this problem, but it doesn't have MPI support.
This is also possible for clAmdFft, but there the total FFT size (product of
dimensions) is currently limited by 2^24, which can be an issue with modern
GPUs.
2) 2D transforms (in slices) can be done in one go (with the same compromises
as described above). This is especially promising for clAmdFft since there we
are currently effectively performing the full 2D FFTs (without regard for
zeroes) due to limited flexibility of the library.
3) MPI transposes (global communications) can be made by the FFTW. Local
transposes (inside slices) can also be done by the FFTW - see
http://www.fftw.org/faq/section3.html#transpose . Probably, something like that
can be devised for clAmdFft.
4) It is also possible to significantly optimize non-FFT parts by restructuring
arrays (especially inside slices). For instance different components can be
made contiguous. This will, however, make FFTs "strided".
Original issue reported on code.google.com by yurkin on 21 Jan 2013 at 6:44
Original issue reported on code.google.com by
yurkin
on 21 Jan 2013 at 6:44