Redirect Aparapi's Vision

xiangyu / aparapi

Automatically exported from code.google.com/p/aparapi

Other

0 stars 0 forks source link

Aparapi's current goal is to be yet another easily-write-a-crappy-GPGPU-implementation framework. These are a dime a dozen, are useless, ignored by mainstream developers, and shunned by HPC developers. Just like Aparapi. The delusion of their creators is that they can just craft some APIs to hide GPUs' architectural complexities. Yet it's a dead-end road. Performance gains vanish, while the overhead of learning and using an additional, often immature framework is still painfully there. Is there another role for a project like Aparapi? YES! Significant productivity and mindshare gains can be realized, without hubristically re-inventing GPGPU by cutting out verbose APIs and by using better programming languages. In large part, this is what CUDA accomplishes relative to OpenCL. Its extended C++ is more developer-friendly than the standard C that is used with OpenCL. Yet it makes no effort to simplify GPGPU. Its Programming Guide lays out the architecture in perfect detail. Aparapi should try to out-CUDA CUDA. This means: 1) Use familiar, friendly Java syntax and OOP to program GPUs without boilerplate library calls 2) WITHOUT trying to hide their architectural complexities, which are fundamental to performance. (These complexities exist not by chance or as artifacts of history, but because very smart people thought very long and very hard and found that they are necessary.) Congratulations in achieving point 1. That is a huge and unique accomplishment! But, Aparapi should try harder to expose the feature-set of GPUs/OpenCL. And not as footnotes, either, but as central aspects of the API. An Aparapi Programming Guide, analogous to (even if much shorter than) the CUDA Programming Guide, would be very helpful. Until then, I won't use it. I might not even use OpenCL, because it's much less developer-friendly than CUDA. Which is a shame, because AMD has excellent hardware.

Thanks for the feedback. You make some very interesting points. Whilst I don't recall having that specific goal ' easily-write-a-crappy-GPGPU-implementation framework' :) I do concede that some effort needs to made on what you call 'Aparapi Programming Guide' would help immensely. often I find that developers 'assume' how Aparapi works and end up coding 'ant-patterns'. We should work harder on this. I don't agree with #2 at all. We consciously traded potential performance gains for simplicity, and the ability to code once for GPU or multi-threaded Java as a fallback. We did this (as you noted) by deliberately avoiding exposing architectural complexities. I note that that when we did attempt to expose some features for performance (local memory comes to mind) we marginally improved performance on the GPU (relative to using global memory only) for a severe performance penalty when we fall back to JTP (multi-threaded mode). Also which particular architectural complexities would you like us to expose to Java developers? I would be interested in hearing your suggestions. My POV is that as technology moves ahead, some of the current Aparapi restrictions (access to Java heap being the huge one, lack of support for simple Java types - Strings, Boxed Int's are others) will disappear or be less conspicuous. Take a look at some of the examples in the proposed lambda/HSA tree. Here we can directly use Strings on the Java heap and use the Java 8 stream programming model. Some of this can be emulated from JCuda, JOpenCL, JOCL or OpenCL versions of Aparapi but HSA enabled Aparapi will be able to do this directly and efficiently. I now tend to consider Aparapi as a bridge technology between current Java GPU libraries and true JVM GPU capabilities down the line in Java 9 (2015) via the OpenJDK Sumatra project. Sumatra will (and in test branches on appropriate hardware) already can show all sorts of code executing on the GPU that we never imagined in the past (TLAB allocations, inlined methods, direct heap access) although there are many challenges ahead (getting JVM 'safepoint' semantics and Java exception handling to work on SIMT/SIMD devices is interesting!). Until Sumatra is available, we will bring a lot of these 'Sumatra like' capabilities to Aparapi, and push the boundaries of what can be efficiently executed on the GPU, from pure Java code, without any knowledge of the GPU architecture. But, as I mentioned earlier, I would be very interested in hearing which particular architectural complexities (presumably based on current OpenCL features) you would like to see exposed, and more specifically how we should expose them from Java. Gary

Thank you for your thorough, non-dismissive reply.

I've glanced at some of the HSA work and saw mentions of it enabling the direct 
use of Java objects (such as arrays of pointers to objects on the heap), and 
was rather puzzled.

HPC machines were, are, and always will be vector based, with their 
"vectorness" permeating their ALUs, memories, caches, registers, etc. Why is 
that? Is it a conspiracy to make programming difficult? Is it a "mistake" that 
can be "rectified"? No. It's just a reflection of some engineering trade-offs.

** The architecture is the point, stupid **

Collections of independent, scalar cores are great. If you don't want to be 
bothered by architectural complexities and do things like reference scattered 
java objects with abandon, they are the answer. In fact, you can go out and buy 
an Intel "Many Integrated Cores" right now, and get a whopping 60 cores to do 
such things on. (Of course, you'll have to ignore the fact that each core has a 
wide vector unit, but as a trade-off let's say the hypothetical pure-scalar MIC 
has 120 cores.) Having 120 cores is pretty good, even if they each run at only 
1 GHz. But GPUs have 2048 ALU elements for the same number of transistors. 
That's a lot better! I'm sure I don't need to spell out how engineers achieve 
that, but the gist is *precisely* that the "independence" of those ALUs (and 
associated registers, memories, etc) is traded off for arithmetic throughput. 
It is inescapable. It is the point! It is the favorable engineering trade-off.

What happens if a programmer ignores that architectural reality? Like 
Cinderella's carriage, those 2048 ALUs revert to working no better than 120 
cores. Or maybe they work worse than 4. So how does that programmer feel at 
missing out on 95% of the potential performance? He feels upset. Was there 
really even a point to trying to grok something new with his old, thick head?

To flip the argument, the AMD hardware engineers have created something great. 
A 2048-ALU vector processor, not a much simpler 120-core scalar CPU. If you 
believe that programmers should program it as if it were (and do horribly 
non-vector things like work with arrays of pointers to Java objects), then that 
hard work is for naught. Worse, AMD would be vulnerable to a competitor 
entering the market with such a simple CPU (*caugh* Intel *caugh* or Samsung w/ 
ARM or even MediaTek) and pushing it out altogether.

Not that I believe that that will happen. Rather, developers who don't care 
about performance won't do anything, while developers who do will use 
frameworks that let them embrace the aforementioned engineering tradeoff.

But, I'll repeat, they'll be happy to use languages and APIs that let them 
focus on thinking about how to optimize their algorithms, and not how to write 
the source code.

Anyway, enough pontification. I haven't given Aparapi much of a spin, but I got 
the impression that event streams are unsupported, a fact made much worse by 
the large overhead of launching a kernel (and doubly worse by Windows's limits 
on long kernels). Local memory and explicit memory transfers are implemented, 
if a bit hidden. I'd make things like that a lot more prominent.

You mention GPU-friendly coding hurting JTP performance. The JTP path is 
stunted at birth in part due to Java's limitations (and in part to insufficient 
optimization). I'd expect any platform I'd deploy Aparapi to to at least have a 
CPU OpenCL driver. Generally, GPU-friendly coding does not make CPU execution 
worse. Especially in the case of local memory. For example, a matrix 
multiplication code that uses local memory exploits memory locality, which is 
just as crucial on a CPU. (The problem with JTP, as I mentioned this in another 
Issue, is that the work units of a single local block should be executed on a 
single CPU core/thread via continuations. In this case, CPU L1 cache would be 
the direct analog for local memory).

Original comment by adubinsk...@gmail.com on 6 Oct 2013 at 8:28

xiangyu / aparapi

Redirect Aparapi's Vision #130