Open GoogleCodeExporter opened 9 years ago
Thanks for the feedback. You make some very interesting points.
Whilst I don't recall having that specific goal '
easily-write-a-crappy-GPGPU-implementation framework' :) I do concede that
some effort needs to made on what you call 'Aparapi Programming Guide' would
help immensely. often I find that developers 'assume' how Aparapi works and
end up coding 'ant-patterns'. We should work harder on this.
I don't agree with #2 at all. We consciously traded potential performance gains
for simplicity, and the ability to code once for GPU or multi-threaded Java as
a fallback. We did this (as you noted) by deliberately avoiding exposing
architectural complexities. I note that that when we did attempt to expose
some features for performance (local memory comes to mind) we marginally
improved performance on the GPU (relative to using global memory only) for a
severe performance penalty when we fall back to JTP (multi-threaded mode).
Also which particular architectural complexities would you like us to expose to
Java developers? I would be interested in hearing your suggestions.
My POV is that as technology moves ahead, some of the current Aparapi
restrictions (access to Java heap being the huge one, lack of support for
simple Java types - Strings, Boxed Int's are others) will disappear or be less
conspicuous.
Take a look at some of the examples in the proposed lambda/HSA tree. Here we
can directly use Strings on the Java heap and use the Java 8 stream programming
model. Some of this can be emulated from JCuda, JOpenCL, JOCL or OpenCL
versions of Aparapi but HSA enabled Aparapi will be able to do this directly
and efficiently.
I now tend to consider Aparapi as a bridge technology between current Java GPU
libraries and true JVM GPU capabilities down the line in Java 9 (2015) via the
OpenJDK Sumatra project. Sumatra will (and in test branches on appropriate
hardware) already can show all sorts of code executing on the GPU that we never
imagined in the past (TLAB allocations, inlined methods, direct heap access)
although there are many challenges ahead (getting JVM 'safepoint' semantics and
Java exception handling to work on SIMT/SIMD devices is interesting!).
Until Sumatra is available, we will bring a lot of these 'Sumatra like'
capabilities to Aparapi, and push the boundaries of what can be efficiently
executed on the GPU, from pure Java code, without any knowledge of the GPU
architecture.
But, as I mentioned earlier, I would be very interested in hearing which
particular architectural complexities (presumably based on current OpenCL
features) you would like to see exposed, and more specifically how we should
expose them from Java.
Gary
Original comment by frost.g...@gmail.com
on 6 Oct 2013 at 4:13
Thank you for your thorough, non-dismissive reply.
I've glanced at some of the HSA work and saw mentions of it enabling the direct
use of Java objects (such as arrays of pointers to objects on the heap), and
was rather puzzled.
HPC machines were, are, and always will be vector based, with their
"vectorness" permeating their ALUs, memories, caches, registers, etc. Why is
that? Is it a conspiracy to make programming difficult? Is it a "mistake" that
can be "rectified"? No. It's just a reflection of some engineering trade-offs.
** The architecture is the point, stupid **
Collections of independent, scalar cores are great. If you don't want to be
bothered by architectural complexities and do things like reference scattered
java objects with abandon, they are the answer. In fact, you can go out and buy
an Intel "Many Integrated Cores" right now, and get a whopping 60 cores to do
such things on. (Of course, you'll have to ignore the fact that each core has a
wide vector unit, but as a trade-off let's say the hypothetical pure-scalar MIC
has 120 cores.) Having 120 cores is pretty good, even if they each run at only
1 GHz. But GPUs have 2048 ALU elements for the same number of transistors.
That's a lot better! I'm sure I don't need to spell out how engineers achieve
that, but the gist is *precisely* that the "independence" of those ALUs (and
associated registers, memories, etc) is traded off for arithmetic throughput.
It is inescapable. It is the point! It is the favorable engineering trade-off.
What happens if a programmer ignores that architectural reality? Like
Cinderella's carriage, those 2048 ALUs revert to working no better than 120
cores. Or maybe they work worse than 4. So how does that programmer feel at
missing out on 95% of the potential performance? He feels upset. Was there
really even a point to trying to grok something new with his old, thick head?
To flip the argument, the AMD hardware engineers have created something great.
A 2048-ALU vector processor, not a much simpler 120-core scalar CPU. If you
believe that programmers should program it as if it were (and do horribly
non-vector things like work with arrays of pointers to Java objects), then that
hard work is for naught. Worse, AMD would be vulnerable to a competitor
entering the market with such a simple CPU (*caugh* Intel *caugh* or Samsung w/
ARM or even MediaTek) and pushing it out altogether.
Not that I believe that that will happen. Rather, developers who don't care
about performance won't do anything, while developers who do will use
frameworks that let them embrace the aforementioned engineering tradeoff.
But, I'll repeat, they'll be happy to use languages and APIs that let them
focus on thinking about how to optimize their algorithms, and not how to write
the source code.
Anyway, enough pontification. I haven't given Aparapi much of a spin, but I got
the impression that event streams are unsupported, a fact made much worse by
the large overhead of launching a kernel (and doubly worse by Windows's limits
on long kernels). Local memory and explicit memory transfers are implemented,
if a bit hidden. I'd make things like that a lot more prominent.
You mention GPU-friendly coding hurting JTP performance. The JTP path is
stunted at birth in part due to Java's limitations (and in part to insufficient
optimization). I'd expect any platform I'd deploy Aparapi to to at least have a
CPU OpenCL driver. Generally, GPU-friendly coding does not make CPU execution
worse. Especially in the case of local memory. For example, a matrix
multiplication code that uses local memory exploits memory locality, which is
just as crucial on a CPU. (The problem with JTP, as I mentioned this in another
Issue, is that the work units of a single local block should be executed on a
single CPU core/thread via continuations. In this case, CPU L1 cache would be
the direct analog for local memory).
Original comment by adubinsk...@gmail.com
on 6 Oct 2013 at 8:28
Original issue reported on code.google.com by
adubinsk...@gmail.com
on 6 Oct 2013 at 2:47