Open GoogleCodeExporter opened 9 years ago
The whole array of objects solution is tricky (as you probably know)because we
basically have to marshall/serialize arrays of objects to one contiguous block
of memory, move to the GPU, then marshall/serialize the mutated objects back.
We do only copy the accessed fields from the accessed objects which helps
minimize the amount of data we move.
The IBM Java folks are proposing and annotation called 'PackedObjects' which
would apply to Array and Object creation and helps denote alignment and padding.
I must confess, that this all seems non-performant. I have been working on the
lambda/HSA branch and it is a pure joy, to allow the GPU to just follow
pointers, just like the CPU does. So we just pass a pointer to an array of
objects and the GPU follows the pointers. There are still challenges with
virtual methods (esp when a new derived class gets loaded - normally the JVM
JIT handles this by recompilong all methods which might be effected).
Gary
Original comment by frost.g...@gmail.com
on 16 Aug 2013 at 6:55
Hm I now went for some different approach. The @InlineClass annotation is
evaluated after bytecode parsing and afterwards explicitly handled in
KernelWriter. Not that good looking, but it seems to work. I'll port my
simulator and find out whether it really works :-)
In effect, this should be better than packing objects to a single memory unit,
as we do not need to explicitly take care of every single object in the arrays,
but can just move the array directly to the GPU. Consequently, the native part
does not even need to know about the annotation, but is just given the
reference no matter where the reference is actually placed.
HSAIL seems pretty nice I must confess. How far are you with that? I always
thought that you need to make the HotSpot Compiler create proper HSAIL code.
Does this code also have to be parsed by Aparapi afterwards? Or can you execute
the code directly on the GPU? There must be some wrapper to tell which GPU
actually has to be used ...
Matthias
Original comment by matthias.klass@gmail.com
on 20 Aug 2013 at 9:58
HSAIL is kind of like bytecode for GPU devices (or more correctly data parallel
style accelerator devices), for Sumatra (the OpenJDK project which I and a few
other Aparapi comitters/contributors are working on) will indeed allow the JVM
JIT compiler (Hotspot or Graal) to create HSAIL as a target, in the same way
that the current JIT's can create x86/sparc/ARM ISA code.
I am working in the branches/lambda tree to convert bytecode to HSAIL in much
the same way that we convert from bytecode to OpenCL at present. The HSAIL
generation is comming along quite nicely, and on real hardware (AMD folk not
surprisingly have early access to HSA enabled hardware) we get some good
performance. Most of this performance comes from not having to move data
between the host memory and the GPU. AMD calls thus hUMA (heterogeneous
unified memory access), but I still refer to it as 'a pointer is a pointer'.
This allows us to access Java heap objects directly on the GPU.
Gary
Original comment by frost.g...@gmail.com
on 20 Aug 2013 at 2:30
Just a couple of quick questions, apologize if I should know these answers:
1) Aren't future OpenCL releases planning to use HSA under-the-covers?
1a) If yes, why target HSAIL directly?
2) Will HSA still work as described in #3 if you do not have an APU/HSA-enabled
device?
2a) If no, is Aparapi planning to have multiple possible execution paths?
2b) If yes, will this pave the way for CUDA as well?
Original comment by pnnl.edg...@gmail.com
on 20 Aug 2013 at 4:14
Answers to questions.
1 + 1a)
Whilst OpenCL may well be implemented on top of OpenCL. The use of SVM (Shared Virtual Memory) is not planned until OpenCL 2.0 (https://www.khronos.org/news/press/khronos-releases-opencl-2.0) so we would still be required to move/copy blocks of memory to the GPU in Aparapi until SVM is generally available.
2 + 2a )
My understanding is that HSA features can only be expected from HSA compatible devices. So for non HSA devices we would need to use OpenCL.
My thoughts are that we would extend the Aparapi execution framework to first look for HSA devices. If HSA device exists and we can convert to HSAIl we target HSAIL. If not (but we have OpenCL) we try OpenCL, if not we dispatch in a thread pool. The good news is that the HSAIl restrictions are a lot smaller than the OpenCL restrictions ;) So I would say that if we can't code in HSAIL we can;t code in OpenCL.
I choose to punt on 2b) :)
Gary
Original comment by frost.g...@gmail.com
on 20 Aug 2013 at 4:30
Original issue reported on code.google.com by
matthias.klass@gmail.com
on 16 Aug 2013 at 1:41