xtensor-stack / xtensor

C++ tensors with broadcasting and lazy computing
BSD 3-Clause "New" or "Revised" License
3.36k stars 398 forks source link

GPU support #192

Open bhack opened 7 years ago

bhack commented 7 years ago

I want to start this topic just to discuss how GPU support could be introduced in the library.

/cc @randl @edgarriba

bhack commented 7 years ago

See also section 3.4 in https://github.com/kokkos/array_ref/blob/master/proposals/P0331.rst

feliwir commented 7 years ago

Would be interesting, indeed

SylvainCorlay commented 7 years ago

Note that we now have strong simd support for broadcasting and a number of other use cases, based on the xsimd project.

feliwir commented 7 years ago

@SylvainCorlay SIMD is good and all, but the performance is not comparable to GPU acceleration. Especially for deep learning applications it makes a lot of sense to use this kind of acceleration. Since this library uses numpy as an api orientation i recommend looking at pytorch, which has similiar goals but uses GPU acceleration.

SylvainCorlay commented 7 years ago

GPU is in scope and in the roadmap. I meamt that the work done for simd actually paved the way since a lot of the required logic in xtensor is the same.

Note that frameworks like pytorch don't implement compile-time loop unfolding like xtensor does which can make xtensor faster in complex expressions.

feliwir commented 7 years ago

awesome, thanks for letting me know. How are you planning to implement it? Are contributions possible?

AuroraDysis commented 6 years ago

I recommend use MAGMA as the backend to support both GPU and CPU

ktnyt commented 5 years ago

Just wondering if there are any updates on the topic. I'd love to make contributions where possible!

wolfv commented 5 years ago

We have GPU support on our roadmap for 2019. However, we're not yet sure how to do it concretely! So any input is highly appreciated. And of course, contributions are very welcome!

The thing we'd probably like to start out with is mapping a container to the GPU, and evaluating a simple binary function, such as A + B.

ktnyt commented 5 years ago

Thanks for the prompt reply!

I haven't been able to make a deep dive into the code yet, but I was thinking that the implementation strategy from a recently released library called ChainerX might be of help. It basically provides device agnostic NumPy like multi-dimensional arrays for C++. AFAIK they provide a Device abstract class that handles memory management and hides the hardware specific implementations for a core set of routines. This is just an idea but the Device specialization for CPU specific code can be can be developed in parallel to xtensor and when it is mature enough switch out the portions of code calling the synonymous routines. The GPU specialization can be filed in later and WIP routines can throw runtime or compile-time errors.

I'm not too familiar with the internals of xtensor so this might be an infeasible approach though.

wolfv commented 5 years ago

definitely a good idea to look at chainerx!

Am Do., 24. Jan. 2019 um 17:25 Uhr schrieb ktnyt notifications@github.com:

Thanks for the prompt reply!

I haven't been able to make a deep dive into the code yet, but I was thinking that the implementation strategy from a recently released library called ChainerX https://github.com/chainer/chainer/tree/master/chainerx_cc might be of help. It basically provides device agnostic NumPy like multi-dimensional arrays for C++. AFAIK they provide a Device abstract class that handles memory management and hides the hardware specific implementations for a core set of routines. This is just an idea but the Device specialization for CPU specific code can be can be developed in parallel to xtensor and when it is mature enough switch out the portions of code calling the synonymous routines. The GPU specialization can be filed in later and WIP routines can throw runtime or compile-time errors.

I'm not too familiar with the internals of xtensor so this might be an infeasible approach though.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/QuantStack/xtensor/issues/192#issuecomment-457108514, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2BPsBBpcRVFf9SXBkK3Lm5AVGmwLoRks5vGW3qgaJpZM4MiR_Y .

miketoastmacneil commented 5 years ago

I'm also interested in this feature although not sure how to do it! As a starting point, would also point out (as a starting point) the ArrayFire package https://github.com/arrayfire/arrayfire. My loose understanding is instead of loop fusion they perform kernel fusion.

wolfv commented 4 years ago

I think we should leave this still open as it isn't solved.

fschlimb commented 4 years ago

This would be very interesting. Any progress/news?

bhack commented 4 years ago

I don't know if you could be interested in https://llvm.discourse.group/t/numpy-scipy-op-set/

JohanMabille commented 4 years ago

This would be very interesting. Any progress/news?

Not yet. We are trying to get some funding to start it.

fschlimb commented 2 years ago

This would be very interesting. Any progress/news?

Not yet. We are trying to get some funding to start it.

Any news on this subject?

fschlimb commented 2 years ago

Is there any description of how you'd envision supporting GPUs, in particular through SYCL?

JohanMabille commented 2 years ago

Unfortunately no. We don't have any funding for implementing this

antoniojkim commented 2 years ago

Why was this issue closed? xtensor does not yet have GPU support, does it?

JohanMabille commented 2 years ago

I don't know why it was closed, but it should definitely stay opened until we can implement it.

antoniojkim commented 2 years ago

are there any updates on a timeline for when xtensor might have GPU support?

Physicworld commented 2 years ago

Hey im a quant open to work on this, i have to research more about how library works and how map your containers to GPU.

What framework is better for this? I mean cuda could not because you want the BEST performance in ant GPUs.

Maybe open cl could work or another.

Also have you check the Nvidia implementarions for std::par and the integrations with? Make all easier but not sure if Will work on your library.

If in the background you have std vectors im pretty shure will work.

feliwir commented 2 years ago

@Physicworld i think the easiest / most portable solution would be to use SYCL

spectre-ns commented 2 years ago

@Physicworld Sycl can be used with multiple backends with full or experimental support for NVidia, AMD and Intel. I think sycl (And CUDA) have partial if not complete GPU implementations of std::algorithms so that might be some low hanging fruit.

spectre-ns commented 2 years ago

Given that sycl can run on the sycl host backend it would be ideal because all the xtensor call could be refactored into sycl then one implementation would work on host or GPU side with only a runtime toggle.

spectre-ns commented 2 years ago

https://github.com/oneapi-src/oneAPI-samples

ksvbka commented 1 year ago

Thank for great lib. Any progress/news?

JohanMabille commented 1 year ago

Nope, we are still searching for funding to implement new features in xtensor.