wdmapp / gtensor

GTensor is a multi-dimensional array C++14 header-only library for hybrid GPU development.
BSD 3-Clause "New" or "Revised" License
34 stars 9 forks source link

half precision axpy #271

Open cmpfeil opened 1 year ago

cmpfeil commented 1 year ago

As of now, usage of half precision is not straightforward. Not only for extension libraries as mentioned in issue #266 , but also for the generation of standard kernels. E.g., using half (from the <cuda_fp16.h> header) instead of double in gtensor/examples/src/daxpy.cxx does not compile.

Here is a minimal demo code reporting compilation errors as comments.

#include <iostream>                                                                                                        
#include <gtensor/gtensor.h>

#include <cuda_fp16.h>
using storage_type = half; // fails to compile with ERRORS#1#2#3
//using storage_type = double; // works, output "y = { 5.7 5.7 5.7 5.7 5.7 }"

using namespace gt::placeholders;

int main(int argc, char** argv)
{
    const storage_type a = 0.5, x0 = 3.0, y0 = 4.2;
    gt::gtensor<storage_type, 1, gt::space::device> x(gt::shape(5), x0); // ERROR#1: ambiguous conversion from half,
    gt::gtensor<storage_type, 1, gt::space::device> y(gt::shape(5), y0); //          but only if init x0, y0 passed

    y = gt::eval(a * x + y); // ERROR#2(a-e)
    // a,b) ambiguous conversion from "storage_type" [gt:operator*]
    // c) no instance of function template "std::declval" matches argument list
    // d) ambiguous conversion from "storage_type" [gt::operator+]
    // e) more than one instance of overloaded function "gt::eval" matches the argument list

    std::cout << "y = " << y.view(_all) << std::endl; // ERROR#3: ambiguous operator <<
    return 0;
}

The full compilation command + detailed error message for the demo code is attached: error_msg_half.txt.

Any input is welcome!

bd4 commented 1 year ago

I am curious if it works with the EXPLICIT_KERNEL version, which uses gt::launch<1> instead of array expressions. The errors are not obvious, but my first guesses are that the cuda_fp16.h header needs to be includes sooner, before gtensor, or that somehow a non-half value gets mixed in somehow and mixed operators are not defined so it tries to do a conversion and there are too many options.

The way I would recommend going about this, if you want to work on it, is to create a new test file in tests/test_half.cxx and start with a super simple host only test case using gt::lanuch_host<1>. Once you have that working, gradually add more complex tests and get those working. Eventually, we want this to be backend independent, so there needs to be a gt::half type that e.g. maps to sycl::half on the intel sycl backend. Note that half is optional in sycl, so we may want there to be an option for turning half support on or off in gtensor.

I can take a closer look if you are stuck. I suspect the errors are going to end up being very backend specific and require some tweaks to get working on each.

bd4 commented 1 year ago

Another way we could go about this, I could add sycl::half support for the intel backend to provide the basic test/type structure, and you could add the cuda and/or AMD backends. Let me know what you prefer - I don't want to spoil your fun if you want to try to dig in more deeply right away, but also don't want to make it too difficult to get started.

cmpfeil commented 1 year ago

Thanks for the quick reply!

Following your input I can report the following:

I will dig into it and try to get a cuda version running, but will consult you when I get stuck - which might be soon^^

Two more notes, which could be of interest:

bd4 commented 1 year ago

Oh I missed that, I would try building without GTENSOR_USE_THRUST set. And in general, it may be safer to use cmake to build, looking at your log I think you have all the relevant options, but cmake builds is what is used most of the time, so there is more possibility of the old Makefile example getting out of date. For example for cmake, the thrust backend is no longer enabled by default.

Note that thrust is still used for some things like the complex type and reductions, even when GTENSOR_USE_THRUST is not set. The name is a bit misleading - it really refers to the storage backend, i.e. whether thrust device_vector is used or the custom gtensor_storage type is used.

Re std::is_floating_point, this is one approach: https://github.com/argonne-lcf/SyclCPLX/blob/main/include/sycl_ext_complex.hpp#L406 just replacing sycl::half which gt::half (which would be alias for cuda half or sycl half or hip half depending on backend).

bd4 commented 1 year ago

If you end up going down the route of adding some tests, feel free to submit a draft PR, and I can see what happens with sycl::half to try to see cross-backend issues sooner rather than later.

germasch commented 1 year ago

FWIW, I tried daxpy with gcc's _Float16, and hit only two issues:

Beyond that, the daxpy example seems to work fine, at least on CPU only. I guess next is taking a look what happens with CUDA.

cmpfeil commented 1 year ago

Thanks for all the input.

Update from my side: With only a slight modification, I got the CUDA explicit kernel version of axpy running (without GTENSOR_USE_THRUST).

#include <iostream>
#include <gtensor/gtensor.h>

#include <cuda_fp16.h>
using storage_type = half;

int main(int argc, char** argv)
{
    int N = 5;
    gt::gtensor<storage_type, 1, gt::space::device> d_x(gt::shape(N));
    gt::gtensor<storage_type, 1, gt::space::device> d_y(gt::shape(N));
    storage_type a = 0.5;

    auto k_x = d_x.to_kernel();                                                                                            
    auto k_y = d_y.to_kernel();

    gt::launch<1, gt::space::device>(d_x.shape(), [=] __device__ (int i) { k_x(i) = 3.0; }); 
    gt::launch<1, gt::space::device>(d_x.shape(), [=] __device__ (int i) { k_y(i) = 4.2; }); 

    gt::launch<1, gt::space::device>(d_x.shape(), 
        [=] __device__ (int i) { k_y(i) = a * k_x(i) + k_y(i); }); 

    gt::gtensor<storage_type, 1, gt::space::host> h_y(gt::shape(N));
    gt::copy(d_y, h_y);

    for(int j = 0; j < N; ++j)
        std::cout << "y[" << j << "] = " << (double) h_y[j] << std::endl; 

    return 0;
}

Apart from the minor work-arounds for initialization and printing, there are three things to note:

  1. GT_LAMBDA is defined as __host__ __device__, but the cuda_fp16.h{pp} defines operators like +(half, half) only for __device__; so the kernel lambda needs to be __device__ only as well.
  2. The example also works with #include <cuda_bf16.h> and using stroage_type = nv_bfloat16;
  3. To have operators, casts, etc. for half available one needs __CUDA_ARCH__ >= 530 (>= 800 for nv_bfloat16), i.e., I additionally needed to specify the nvcc flags --generate-code arch=compute_80,code=sm_80 on A100.

Again, I'm happy for any feedback!

I will only have time to continue on it late next week. Do you think the next step should be to get the implicit kernel running, or does something else make more sense?

bd4 commented 1 year ago

Making GT_LAMBDA device only is a major problem, it would break all host launches. Not obvious to me how to workaround that limitation - I wonder if there is a higher level library out there built on top of CUDA that provides a more friendly half type? One ugly possibility would be to define a separate GT_LAMBDA_HOST and GT_LAMBDA_DEVICE, and avoid use of half types in the _HOST. But that is a very ugly breaking change.

Other than finding a solution to that, I think the next step is to define a gt::half type (possibly just an alias for a backend specific type) and some basic tests for it.

bd4 commented 1 year ago

I wonder if implicit conversion operators from backend specific device type to an appropriate compiler specific host type would help here? Like half -> _Float16 when using CUDA + g++.

bd4 commented 1 year ago

I think it might make sense for me to try this out with sycl::half and _Float16, for sycl and host backends respectively. I suspect those to be easy, and then when you have some time to work on this again, the framework will exist (in a branch) for how to integrate CUDA/HIP support.

bd4 commented 1 year ago

This might be worth trying? https://forums.developer.nvidia.com/t/error-when-trying-to-use-half-fp16/39786/10

cmpfeil commented 1 year ago

Making GT_LAMBDA device only is a major problem, it would break all host launches. Not obvious to me how to workaround that limitation - I wonder if there is a higher level library out there built on top of CUDA that provides a more friendly half type? One ugly possibility would be to define a separate GT_LAMBDA_HOST and GT_LAMBDA_DEVICE, and avoid use of half types in the _HOST. But that is a very ugly breaking change.

Other than finding a solution to that, I think the next step is to define a gt::half type (possibly just an alias for a backend specific type) and some basic tests for it.

I think appropriate design of the gt::half type will automatically cover this problem and keep GT_LAMBDA unchanged. In the axpy code below, note that the custum type HalfWrapper allows to use GT_LAMBDA rather than [=] __device__.

#include <iostream>
#include <gtensor/gtensor.h>

#include "half_wrapper.hxx"
using storage_type = HalfWrapper;

int main(int argc, char** argv)
{
    int N = 5;
    gt::gtensor<storage_type, 1, gt::space::device> d_x(gt::shape(N));
    gt::gtensor<storage_type, 1, gt::space::device> d_y(gt::shape(N));
    storage_type a = 0.5;

    auto k_x = d_x.to_kernel();
    auto k_y = d_y.to_kernel();

    gt::launch<1, gt::space::device>(d_x.shape(), GT_LAMBDA(int i) { k_x(i) = 3.0; }); 
    gt::launch<1, gt::space::device>(d_x.shape(), GT_LAMBDA(int i) { k_y(i) = 4.2; }); 

    gt::launch<1, gt::space::device>(d_x.shape(), 
        GT_LAMBDA(int i) { k_y(i) = a * k_x(i) + k_y(i); }); 

    gt::gtensor<storage_type, 1, gt::space::host> h_y(gt::shape(N));
    gt::copy(d_y, h_y);

    for(int j = 0; j < N; ++j)
        std::cout << "y[" << j << "] = " << h_y[j] << std::endl; 

    return 0;
}

Where HalfWrapper is a prototype for gt::half which uses half as storage type. On the CUDA device computations are performed in half, while on the host calculations just fall back to some compute precision which the host understands (e.g. float).

// half_wrapper.hxx
#ifndef HALF_WRAPPER
#define HALF_WRAPPER

#include <iostream>
#include <cuda_fp16.h>

#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 530)
#define TARGET_ARCH __host__ __device__
#else
#define TARGET_ARCH
#endif

class HalfWrapper
{
public:
    TARGET_ARCH HalfWrapper(float x) : x(x) {}; 
    TARGET_ARCH HalfWrapper(half x) : x(x) {}; 
    TARGET_ARCH const HalfWrapper& operator=(const float f) { x = f; return *this; }
    TARGET_ARCH const half& Get() const { return x; }
private:
    half x;
};

TARGET_ARCH const HalfWrapper operator+(const HalfWrapper& lhs, const HalfWrapper& rhs)
{
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 530)
return HalfWrapper( lhs.Get() + rhs.Get() );
#else
return HalfWrapper( float(lhs.Get()) + float(rhs.Get()) );
#endif
}

TARGET_ARCH const HalfWrapper operator*(const HalfWrapper& lhs, const HalfWrapper& rhs)
{
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 530)
return HalfWrapper( lhs.Get() * rhs.Get() );
#else
return HalfWrapper( float(lhs.Get()) * float(rhs.Get()) );
#endif
}

std::ostream& operator<<(std::ostream& s, const HalfWrapper& h)
{ s << (float) h.Get(); return s; }

#endif

Of course different roundings will lead to slightly different results on the host, but just falling back to float definitely requires the least implementation effort for now.

What do you think?

bd4 commented 1 year ago

I think that is a great place to start! Trying to detect if host compiler has support and use it when available, e.g. gcc's _Float16, can be a later enhancement (and in particular I think is not important for the way GENE uses gtensor). Do you feel comfortable adding a header in include/gtensor/half.h (or maybe fp.h to indicate it may support e.g. fp8 too in future), and a simple test in tests/test_half.cxx, and opening a draft PR? Might be worth creating a main fork branch for this too, so we can collaborate more easily vs using one of our forks.

We will probably also need a new cmake option GTENSOR_ENABLE_FP16 or similar, that disables test_half when not set and does not include the half.h header in gtensor.h. But that can be added later too. One advantage of this is, we can make it OFF by default and get more experimental stuff mainlined sooner without worrying about breaking some combo of CUDA/HIP version and driver and other messiness.

My understanding for SYCL is that sycl::half must be available, but talking with @TApplencourt, it is UB to perform operations on host. This is relevant part of spec: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:opencl:extension-fp16

Note that this does happen to work on Intel oneAPI icpx, at least with 2023.1.0 and Gen11 iGPU and a discrete GPU I tested:

#include <iostream>

#include <sycl/sycl.hpp>

int main(int argc, char **argv) {
  sycl::half a, b, c;
  a = 1;
  b = 1;
  c = 0;
  c = a + b;
  std::cout << "c = " << c << std::endl;
}
cmpfeil commented 1 year ago

Sure - I will follow the steps you suggested and open a draft PR. Got 1-2 busy weeks ahead, but after that I'm all on it.

cmpfeil commented 1 year ago

Due to holidays everything took a bit longer than expected, but now it's there: draft PR #276.