yaohuaxin / aparapi

Automatically exported from code.google.com/p/aparapi
Other
0 stars 0 forks source link

Multiple Entrypoints #124

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Hi,

I got some ideas about implementing multiple entrypoints that I am thinking 
about implementing for Aparapi. To be honest, I am not really a fan of the 
multiple entry points proposal withing the Wiki pages, as you are still 
required to put all your kernel methods into one single class. What I'd like to 
have is the possibility to structure the entrypoints into multiple classes, in 
order to get some means to structure the whole program.

What I came up with is the proposal within the appendix. Before implementing, 
I'd like to hear some feedback from the experts :-).

The first and biggest change would be in my opinion to change the current 
KernelRunner to a global Context class. The responsibility would be to hold the 
JNI context and manage any kernel calls including compilation and generation of 
OpenCL code. In addition, it has to hold any references to copied GPU objects, 
as it currently does.

In contrast to the current implementation, the Kernel class would not be 
derived by KernelRunner. Instead, an instance of Kernel would be passed on each 
Kernel call. The KernelRunner can hold a map of each compiled Kernel class and 
call the right implementation.

The current implementation about KernelStates and so on can remain the same. 
Even passing references across kernels should not break the current behaviour, 
as Aparapi checks whether it needs to copy references to the kernel and not 
distinct fields of Kernels.

However, the respective put and get methods would have to move to the Context 
to be able to access the JNI context afterwards.

Some small additional change would be to move all those OpenCL methods for math 
implementation (floor, sqrt, ...) to an own class (to get some more structure).

So what could a kernel look like afterwards? See Main.java in the appendix of 
this issue. Using two "Kernel" objects there would be two current entry points. 
To me, this looks like a pretty clean API. 

Thinking some more about the future, this could be used to provide some default 
math kernels. For example, there could be a MatrixMultiplicationKernel, that 
can use an underlying optimized BLAS algorithm to run in an optimized way on 
the GPU (this could really speed up execution for Aparapi). 

What do you think about that proposal? Are there any no-goes here? I got some 
time left for my master's thesis, so I would try to create a prototype which 
would result (finally) in the feature of multiple entry points in Aparapi :-).

Matthias

Original issue reported on code.google.com by matthias.klass@gmail.com on 11 Jul 2013 at 9:36

Attachments:

GoogleCodeExporter commented 8 years ago
Oh sorry, Kernel currently does not inherit from KernelRunner but only 
delegates the call. Sorry for the inconsistence ...

Original comment by matthias.klass@gmail.com on 11 Jul 2013 at 9:49

GoogleCodeExporter commented 8 years ago
Ok here is a proof of concepts. Multiple entrypoints are not working yet, but I 
changed the API to support passing of kernel instances. Mandelbrot compiles and 
works using GPU execution.

Link: https://github.com/klassm/aparapi-clone
(just temporary on Github ...)

Matthias

Original comment by matthias.klass@gmail.com on 11 Jul 2013 at 12:58

GoogleCodeExporter commented 8 years ago
A simple question: When executing a new entry point, the one has to be prepared 
for execution by generating code and by setting the proper JNI args. Is it ok 
to reinit the JNI args with all the kernel entrypoint arguments?

Kernel A: fieldA, fieldB
Kernel B: fieldC, fieldD

The total of all four fields would be used afterwards. This has the disdvantage 
that only update calls all fields have to be updated, which might result in a 
slowdown. However, a workaround would be to store the kernelargs per entrypoint 
a map and only call update on required arguments.

What do you think?

Original comment by matthias.klass@gmail.com on 11 Jul 2013 at 2:03

GoogleCodeExporter commented 8 years ago
Matthias, thanks for diving in here. 

When you look through the code you will see some coding attempts at multiple 
entrypoint. Actually you may see multiple attempts to solve this problem in the 
code as you experiment. 

Each time I got stuck on different aspects. 

Initially it was how to dispatch.  With a single abstract method type 
(Kernel.run()) as the only entrypoint it is easy to map Kernel.execute() to 
dispatch the Kernel.run() method. When there are multiple possible entrypoints 
I think we need to either rely on String name mapping and/or reflection. The 
new method handle (Java 7) does offer a cleaner mapping.

BTW I think the bound interface approach that we used for accessing pre 
constructed OpenCL may offer a possible solution here.  So instead of just 
extending Kernel a user creates an interface which exposes the 'entrypoints' 
and implements the interface as part of their Kernel definition.  Then we can 
use Java's 'proxy' mechanism to construct a real object which delegates to the 
KernelRunner (I think I need to draw a diagram for this ;) ). Java 8's lambdas 
solve this using method handles and synthetically generated inner-classes on 
the fly. 

The other issue I encountered was dealing with arrays/buffers which are 
accessed inconsistently (RW vs RO vs WO) depending on the entrypoint. Because 
we have no idea what order of dispatch might take place we may need to fall 
back to the 'minimal' restriction.  So if entrypoint E1 accessed A RW and 
entrpoint E2 accesssed A RO we define the buffer as RW and always pass back and 
forth between calls. 

The latter can be simplified a little by forcing explicit buffer transfers when 
using multiple entrypoints.  

I do think we need one 'context' (OpenCL command queue) shared between all 
possible entrypoints.  The KernelRunner can act in this role I think. I am not 
sure another level of abstract is needed.    

I am very interested in this work, like I said I have approached this multiple 
times already and got overwhelmed ;) I really do welcome someone with a fresh 
pair of eyes and a different perspective taking a crack at this. 

If you would like to bounce ideas around, I would be more than happy to do 
this. 

Original comment by frost.g...@gmail.com on 11 Jul 2013 at 2:21

GoogleCodeExporter commented 8 years ago
Hi,

sounds good :-). The current state is that I really split up Kernels from the 
KernelRunner, which now, as you described, acts a single holder for the JNI 
context. I can even start multiple kernel objects, as long as they contain the 
same arguments. What is missing is the managing of different kernel fields. 
I'll find a solution for this ;-)

Matthias

Original comment by matthias.klass@gmail.com on 11 Jul 2013 at 2:28

GoogleCodeExporter commented 8 years ago
Hi Matthias,

This is an excellent issue request...and is a dupe of 
http://code.google.com/p/aparapi/issues/detail?id=21.

But no worries, what you are describing in this issue speaks to some 
discussions we've had in the past. See the following:

Specifically, Comment #3:
http://code.google.com/p/aparapi/issues/detail?id=105#c3

http://code.google.com/p/aparapi/issues/detail?id=104

I think that decoupling a number of the classes and changing the way OpenCL is 
executed will work towards a number of goals. One thing that would be very nice 
to see, for example, would be a Kernel accept another Kernel as an argument 
allowing us to chain calls.

I'll keep track of this discussion. In the next few weeks, I will have some 
more time available and plan to take another look at a couple of things in 
Aparapi. I also plan to submit a fairly rigorous Correlation Matrix test case 
that I need some eyeballs to look at for performance modifications. Or 
potentially use it as a test case for issue tickets like this one :)

Original comment by pnnl.edg...@gmail.com on 11 Jul 2013 at 10:19

GoogleCodeExporter commented 8 years ago
I'll try to keep you up to date, which is why I am posting my progress today:

* I'll commit any changes to https://github.com/klassm/aparapi-clone, where you 
will have a chance to watch my progress. If you want, we can later merge it 
back to a svn branch.

* The KernelRunner by now got a map of multiple JNIContext values. Whenever a 
kernel run is scheduled, the right one is pulled from the map and executed. 
(BTW: I somehow cleaned up the KernelRunner class a little bit, to split the 
execute method into some more readable methods.)

* Internally, on JNI side, I plan to have the JNIContext objects to map to the 
same OpenCL context. This is why the OpenCL init will move to some global 
object where all JNIContexts can access it. This is still missing by now.

* Another thing which is missing, and which is what I am currently pondering 
about, is how to make sure that KernelArgs can refer to the same Java object 
references. I thought about mapping KernelArgs to GPUArgs (which represent java 
objects currently on the GPU). That way, KernelArgs from multiple entrypoints 
could refer to the same GPU memory locations. This is a bit tricky and, by now, 
I am not really sure how I want to implement that.

When both issues above are done, I think it should be possible to execute 
multiple kernels (or am I missing somethin?). Let's see!

Original comment by matthias.klass@gmail.com on 15 Jul 2013 at 2:42

GoogleCodeExporter commented 8 years ago
Very nice, I look forward to tracking your progress.

Original comment by pnnl.edg...@gmail.com on 16 Jul 2013 at 1:31

GoogleCodeExporter commented 8 years ago
Ok, finally it works. There might be still some bugs, but essentially it is 
possible to execute multiple kernel entry points on the same JNI context. 
Example: 
https://github.com/klassm/aparapi-clone/blob/master/test/runtime/src/java/com/am
d/aparapi/test/runtime/MultipleKernelCall.java

This was a bigger change now, as I wanted to map the buffers of the KernelArgs 
to each other. That way multiple kernel args can point to the same GPU memory - 
which is pretty neat I guess. That for I implemented a BufferManager, which is 
responsible for managing all the buffers refering to java objects as well as 
cleaning up afterwards, so to not leave over any memory allocated on the GPU 
after execution.

In addition, the OpenCL context moved to a global attribute to make it 
accessibele from mulitple JNIContexts - this has the consequence, that the 
execution device can only be set once (for the first kernel call). I'll have to 
adjust the API accordingly.

The main test cases work with my implementation. However, some of the do not 
work. I went to look whether this is because of me or whether this also happens 
in trunk. Result: the tests also fail there. The concerning test cases are:
- Game of live (only black screen?)
- Issue 102 (OOP conversion)
- Issue 103
- UseStaticArray

So I'll go change the API :-)

Original comment by matthias.klass@gmail.com on 17 Jul 2013 at 10:01

GoogleCodeExporter commented 8 years ago
I just tried how my implementation behaved in terms of speed. Using my fluid 
simulation, I cannot recognize any differences. The next step for me is to 
change my fluid solver implementation to the new multiple entry points 
implementation. As this is a pretty big use case I hope to uncover any hidden 
errors. I'll report back on how it is going.

Original comment by matthias.klass@gmail.com on 17 Jul 2013 at 2:54

GoogleCodeExporter commented 8 years ago
Regarding game of life not working. 

It works for me.  If you have a smaller screen size, the [start] button may be 
hidden (off the bottom of the screen) and so it may appear to just be a blank 
screen. 

If you are using Linux, clicking on the frame should make the start button 
visible. 

Can you check this. 

BTW your exposing of the KernelRunner is interesting. Before Aparapi was called 
Aparapi it was a much smaller project called 'Barista' and it allowed multiple 
kernels to be executed by a single KernelRunner (which indeed held the context, 
queue and managed buffers). 

So in barista we would do something like 
KernelRunner kr = new KernelRunner();
Kernel k1 = new Kernel(){
};
Kernel k2 = new Kernel(){
};
kr.run(k1);
kr.run(k2); 

;) 

Early comments indicated that for simple examples exposing the KernelRunner was 
too verbose.  

Now we have evolved a little I think that we should have kept the KernelRunner 
as a standalone class.  It would also help with explicit buffer management.

Thanks for putting this together.  I will download your repository soon and 
give this a whirl.  

This is very interesting work. 

Gary

Original comment by frost.g...@gmail.com on 17 Jul 2013 at 3:06

GoogleCodeExporter commented 8 years ago
Hi Gary,

you were right. The start button was in deed hidden - I should have seen that 
...

Because you started this - what was the origin of Aparapi? I always thought 
that the framework has its roots within AMD. Concerning Barista I did not find 
anything on Google - an internal framework? This would be just interesting for 
the final presentation of my master thesis - some background information on the 
framework I chose :-). 

I also thought about making the KernelRunner a singleton and letting the user 
continue to call execute directly on the Kernel. However, if the user calls 
dispose on the KernelRunner, the whole instance will be disposed. We would need 
a much more elaborated lifecycle then. This would be a nice thing I guess - 
there's always something to do :-)

Matthias

Original comment by matthias.klass@gmail.com on 17 Jul 2013 at 3:14

GoogleCodeExporter commented 8 years ago
Barista was the internal name at AMD.  Two weeks before our first public 
'reveal' (JavaOne 2010) there was an internal AMD request to change the name (I 
think there was an established open source project with this name).

The name Aparapi was the result of this last minute scramble :)

Barista/Aparapi was started after I was asked to write a Java based app which 
used OpenCL for 'SuperComputing' 2009. Whilst I was happy to learn OpenCL to 
write the required OpenCL code (I think I used JOCL as the binding - and it 
worked well!) I came to the conclusion that most Java developers would prefer 
not to do this. 

I was a big fan of Java tools JAD and Mocha (which both create perfectly 
serviceable Java source from bytecode) so I decided to see how hard it might be 
to parse bytecode and turn it into OpenCL.  The basic (very crude - enough to 
run NBody example) bytecode to OpenCL engine took around 3 weeks over Christmas 
of 2009.  

The hardest part (and the part we are still struggling with) is how much to 
expose to the Java developer....

Gary       

Original comment by frost.g...@gmail.com on 17 Jul 2013 at 5:50

GoogleCodeExporter commented 8 years ago
Thanks for the background info! So the development is still supported by AMD or 
has this shifted in the meantime towards open source / leisure time :-)?

To be honest, I also like this native kind of GPU binding. For my evaluation, I 
also looked at JCuda, which is the equivalent of JOCL for CUDA. This is by far 
the fastest framework I could find to program GPUs within Java. Partially, it 
is over 40-100 times faster than other frameworks. And it is not even that bad 
to program...

Matthias

Original comment by matthias.klass@gmail.com on 18 Jul 2013 at 9:02

GoogleCodeExporter commented 8 years ago
OK, finally the execution completely works. You might want to have a look at 
the implementation. As final test, I ported the fluid simulation to the new API 
and backend. This one uses explicit buffer handling and 14 kernels.

To give you an impression how much those multiple entrypoints change the way to 
use the framework, I created two class diagrams:
- solver using the old API: 
http://www.hs-augsburg.de/~klassm/simulator_structure.png
- solver using the new API: 
http://www.hs-augsburg.de/~klassm/simulator_structure_new.png 

The new image does not contain all the information, as the image would have 
been too unclear. Instead, I added only the kernels themselves. The surrounding 
stayed the same.
By using the multiple entries I could finally split the one whole kernel into 
multiple java objects. The individual kernel arguments are mapped by a 
BufferHandler-class in c++ to ArrayBuffers and OpenCL memory. 

... and finally a small video what the simulator looks like: 
http://www.hs-augsburg.de/~klassm/simulator.mkv

Matthias

Original comment by matthias.klass@gmail.com on 23 Jul 2013 at 1:55

GoogleCodeExporter commented 8 years ago
Matthias, 

Nice work, and thanks for the video (I have been sending links to folk on our 
team ;))

I plan to take a deeper look at this, when I get some time.  

Gary

Original comment by frost.g...@gmail.com on 23 Jul 2013 at 2:29

GoogleCodeExporter commented 8 years ago
Jip sure. By the way - as of execution time: Aparapi is about 5 times slower 
than a native implementation (measured from my fluid solver example). This 
should be pretty representative, as loads of kernels are executed. Aparapi also 
takes about 1.5 times the execution time of JCuda (which is only a JNI wrapper. 
This info is taken from various benchmarks incl. Mandelbrot and Matrix 
Multiplication. Just as info ...

Original comment by matthias.klass@gmail.com on 23 Jul 2013 at 2:48

GoogleCodeExporter commented 8 years ago
Thanks for the #'s.

Do you also have a sense of the performance relative to a pure Java solution?

Gary

Original comment by frost.g...@gmail.com on 23 Jul 2013 at 3:21

GoogleCodeExporter commented 8 years ago
Hi,

sure, but currently only for Matrix Multiplication. The other tasks 
(Mandelbrot, Conjugate Gradient, Jakobi iterations) are still running and 
taking forever ...

Aparapi is currently about 20x faster than a serial implementation.

Matthias

Original comment by matthias.klass@gmail.com on 23 Jul 2013 at 3:24

GoogleCodeExporter commented 8 years ago
Hi,

multiple entries now resulted in a more or less bigger refactoring. Before the 
restructuring, I used a global command queue. This one does not really work, as 
multiple KernelRunners would interfer with each other.

To sidestep this behaviour, I changed some things:
* JNIContext => KernelContext
  (represents the context for a single kernel being executed natively)
* KernelRunnerContext (new)
  (represents the context for a single kernelRunner, now includes a command queue and all the openCL dependent attributes)

In order to make this work, most of the JNI methods got an additional parameter 
referencing the kernelRunnerContext address (the same hook as previously used 
for the JNIContext).

Because I already was refactoring, I additionally removed the aparapiBuffer and 
arrayBuffer attributes from KernelArgs and introduced a single buffer object of 
type GPUElement*.
ArrayBuffer and AparapiBuffer now derive from GPUElement. Using this 
polymorphism, I could delete a whole bunch of code.

Finally, I changed the behaviour of the newly implemented BufferManager. Its 
responsibility is to look after all instantiated buffers and make sure that all 
buffers without any reference are freed. Previously, I looped over all 
JNIContexts (alias KernelContexts), afterwards over all KernelArgs and tried to 
figure out which elements in my global buffer queue were not referenced. Now I 
just keep an integer within the buffer indicating how many times they are 
referenced. If the reference count is 0, I can free them. This skips 4 
expensive loops and speeds up execution.

Just to keep you up to date ...
Matthias

P.S.: I'll create a class diagram to show this more clearly.

Original comment by matthias.klass@gmail.com on 1 Aug 2013 at 2:58

GoogleCodeExporter commented 8 years ago
"Finally, I changed the behaviour of the newly implemented BufferManager. Its 
responsibility is to look after all instantiated buffers and make sure that all 
buffers without any reference are freed. Previously, I looped over all 
JNIContexts (alias KernelContexts), afterwards over all KernelArgs and tried to 
figure out which elements in my global buffer queue were not referenced. Now I 
just keep an integer within the buffer indicating how many times they are 
referenced. If the reference count is 0, I can free them. This skips 4 
expensive loops and speeds up execution."

Just a thought, I haven't had a chance to look at this code, but based on what 
you wrote in quotes is this a candidate for WeakHashMap instead of manual 
reference counting?

Original comment by pnnl.edg...@gmail.com on 1 Aug 2013 at 7:19

GoogleCodeExporter commented 8 years ago
Ok finally the two class diagrams, one for the native and one for the java side.

@pnnl: Yes, something of that kind would be really nice. However, WeakHashMap 
is Java, or is there an equivalent for C++? I do not really want to add a 
library dependencies just to replace one reference counter.

Matthias

Original comment by matthias.klass@gmail.com on 2 Aug 2013 at 1:57

Attachments:

GoogleCodeExporter commented 8 years ago
Another two graphics which might be interesting for you:

- The first one is about a comparison of a native fluid simulator against one 
implemented in Aparapi. The x-axis represents the cell count of the simulation, 
the y-axis the execution time in ms. The green curve is default Aparapi, the 
blue one my adapted version with multiple entry points and the Device#best fix 
(which is why the blue one is better, otherwise the time would have been the 
same).

- The second one is about a comparison to other frameworks. Maybe you know your 
competitors? The graph contains Delite (Stanford University), Rootbeer 
(Syracude University), JCuda and native CUDA. The execution time of Rootbeer is 
really high, which is due to multiple unnecessary kernel invokes. Delite has a 
pretty huge overhead for CUDA execution. JCuda has an execution time of nearly 
exactly the native CUDA execution (it is just the JNI overhead). Aparapi is 
usually a little slower than native CUDA. In some exceptional cases Aparapi is 
even a little faster (for example in Mandelbrot). This might be due to 
concurrent copying...

Matthias

Original comment by matthias.klass@gmail.com on 7 Aug 2013 at 3:18

Attachments:

GoogleCodeExporter commented 8 years ago
Thank you for posting this. 

Clearly Marco Hutter deserves some kudos for his JCuda work (and for JOCL which 
is very well structured). I need to send this to Marco... he has always been 
very supportive of Aparapi goals. 

+ thanks for motivating me to look even closer at the Device.best() fix ;) 

Gary

Gary

Original comment by frost.g...@gmail.com on 7 Aug 2013 at 6:51

GoogleCodeExporter commented 8 years ago
@matthias I've been really hoping for an update like this. Looking at 
programming .cl code, this is pretty much how it is done. It has the benefit of 
treating buffers as ro, rw, etc depending on the function, as well as making it 
so much easier to program.

I've built your branch and am going to give it a go now, but if this could be 
incorporated into the main project that would be fantastic. 

A lot of the classes I use that require multiple entry get extended, and using 
a mode buffer and if/else gets messy with inheritance.

Original comment by technoth...@gmail.com on 28 Sep 2013 at 4:46

GoogleCodeExporter commented 8 years ago
Also, just a thought, could this also solve some problems for 2d arrays (at 
least when processing a single sub array at a time).

For example, for a simple feed forward neural network

Original comment by technoth...@gmail.com on 28 Sep 2013 at 5:07

Attachments:

GoogleCodeExporter commented 8 years ago
Hi,
nice that it works for you! Jip, I also would appreciate an integration into 
main. However, I do not think that I am the best one to do that work - this 
should be done by some AMD Aparapi developers like Gary for them to be able to 
support the code base later on :-).
Matthias

Original comment by matthias.klass@gmail.com on 8 Oct 2013 at 1:13

GoogleCodeExporter commented 8 years ago
Any update on getting Gary's work into mainline AMD Aparapi? I also would like 
to use multiple kernels on shared data and his proposal seems clean and simple 
way of doing this.

Original comment by paulsou...@gmail.com on 19 Oct 2014 at 5:32