Open bhack opened 7 years ago
See also section 3.4 in https://github.com/kokkos/array_ref/blob/master/proposals/P0331.rst
Would be interesting, indeed
Note that we now have strong simd support for broadcasting and a number of other use cases, based on the xsimd project.
@SylvainCorlay SIMD is good and all, but the performance is not comparable to GPU acceleration. Especially for deep learning applications it makes a lot of sense to use this kind of acceleration. Since this library uses numpy as an api orientation i recommend looking at pytorch, which has similiar goals but uses GPU acceleration.
GPU is in scope and in the roadmap. I meamt that the work done for simd actually paved the way since a lot of the required logic in xtensor is the same.
Note that frameworks like pytorch don't implement compile-time loop unfolding like xtensor does which can make xtensor faster in complex expressions.
awesome, thanks for letting me know. How are you planning to implement it? Are contributions possible?
I recommend use MAGMA as the backend to support both GPU and CPU
Just wondering if there are any updates on the topic. I'd love to make contributions where possible!
We have GPU support on our roadmap for 2019. However, we're not yet sure how to do it concretely! So any input is highly appreciated. And of course, contributions are very welcome!
The thing we'd probably like to start out with is mapping a container to the GPU, and evaluating a simple binary function, such as A + B.
Thanks for the prompt reply!
I haven't been able to make a deep dive into the code yet, but I was thinking that the implementation strategy from a recently released library called ChainerX might be of help. It basically provides device agnostic NumPy like multi-dimensional arrays for C++.
AFAIK they provide a Device
abstract class that handles memory management and hides the hardware specific implementations for a core set of routines.
This is just an idea but the Device
specialization for CPU specific code can be can be developed in parallel to xtensor and when it is mature enough switch out the portions of code calling the synonymous routines.
The GPU specialization can be filed in later and WIP routines can throw runtime or compile-time errors.
I'm not too familiar with the internals of xtensor so this might be an infeasible approach though.
definitely a good idea to look at chainerx!
Am Do., 24. Jan. 2019 um 17:25 Uhr schrieb ktnyt notifications@github.com:
Thanks for the prompt reply!
I haven't been able to make a deep dive into the code yet, but I was thinking that the implementation strategy from a recently released library called ChainerX https://github.com/chainer/chainer/tree/master/chainerx_cc might be of help. It basically provides device agnostic NumPy like multi-dimensional arrays for C++. AFAIK they provide a Device abstract class that handles memory management and hides the hardware specific implementations for a core set of routines. This is just an idea but the Device specialization for CPU specific code can be can be developed in parallel to xtensor and when it is mature enough switch out the portions of code calling the synonymous routines. The GPU specialization can be filed in later and WIP routines can throw runtime or compile-time errors.
I'm not too familiar with the internals of xtensor so this might be an infeasible approach though.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/QuantStack/xtensor/issues/192#issuecomment-457108514, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2BPsBBpcRVFf9SXBkK3Lm5AVGmwLoRks5vGW3qgaJpZM4MiR_Y .
I'm also interested in this feature although not sure how to do it! As a starting point, would also point out (as a starting point) the ArrayFire package https://github.com/arrayfire/arrayfire. My loose understanding is instead of loop fusion they perform kernel fusion.
I think we should leave this still open as it isn't solved.
This would be very interesting. Any progress/news?
I don't know if you could be interested in https://llvm.discourse.group/t/numpy-scipy-op-set/
This would be very interesting. Any progress/news?
Not yet. We are trying to get some funding to start it.
This would be very interesting. Any progress/news?
Not yet. We are trying to get some funding to start it.
Any news on this subject?
Is there any description of how you'd envision supporting GPUs, in particular through SYCL?
Unfortunately no. We don't have any funding for implementing this
Why was this issue closed? xtensor does not yet have GPU support, does it?
I don't know why it was closed, but it should definitely stay opened until we can implement it.
are there any updates on a timeline for when xtensor might have GPU support?
Hey im a quant open to work on this, i have to research more about how library works and how map your containers to GPU.
What framework is better for this? I mean cuda could not because you want the BEST performance in ant GPUs.
Maybe open cl could work or another.
Also have you check the Nvidia implementarions for std::par and the integrations with? Make all easier but not sure if Will work on your library.
If in the background you have std vectors im pretty shure will work.
@Physicworld i think the easiest / most portable solution would be to use SYCL
@Physicworld Sycl can be used with multiple backends with full or experimental support for NVidia, AMD and Intel. I think sycl (And CUDA) have partial if not complete GPU implementations of std::algorithms so that might be some low hanging fruit.
Given that sycl can run on the sycl host backend it would be ideal because all the xtensor call could be refactored into sycl then one implementation would work on host or GPU side with only a runtime toggle.
Thank for great lib. Any progress/news?
Nope, we are still searching for funding to implement new features in xtensor.
I want to start this topic just to discuss how GPU support could be introduced in the library.
/cc @randl @edgarriba