x448 / float16

float16 provides IEEE 754 half-precision format (binary16) with correct conversions to/from float32
MIT License
66 stars 8 forks source link

Hardware acceleration #47

Open TailsFanLOL opened 5 months ago

TailsFanLOL commented 5 months ago

Hey! Can this use hardware instructions for conversion? Intel CPUs support hardware conversion since 2013, and the new 12-th gen also has support for arithmetic (I think?). Other architectures had that a while ago.

This might be possible without compiler support using embedded C code, but wouldn't that be out of scope for this?

TailsFanLOL commented 5 months ago

Nvm, 12-th gen had this for a brief moment added by accident, but then got removed from later revisions. This probably means it is going to be in other upcoming Core series CPUs. It was already a thing for Sapphire Rapids Xeons.

I also opened an upstream issue.

swdee commented 1 month ago

In my use case we have float16 tensor outputs from a NPU on the RK3588 (Arm processor). ARM does have NEON SIMD instructions to hardware accelerate the conversion from fp16 to fp32. We can't make use of those extensions with Go as the compiler does not support SIMD instructions.

Via CGO you can interface with the ARM Compute library to make use of these instructions, however for our use case which involves converting 856,800 bytes from uint16->fp32 per video frame this is much slower than sticking with pure Go in this library.

However better performance is still attainable by using a precalculated lookup table for the uint16->fp32 conversion.

On the RK3588 we get a 35% performance improvement.

BenchmarkF16toF32NormalConversion-8 150 7872802 ns/op 1720348 B/op 1 allocs/op
BenchmarkF16toF32LookupConversion-8 218 5123550 ns/op 1720342 B/op 1 allocs/op

And on a Threadripper workstation we get a 69% improvement.

BenchmarkF16toF32NormalConversion-20 1302 916041 ns/op 1720322 B/op 1 allocs/op
BenchmarkF16toF32LookupConversion-20 3919 275437 ns/op 1720335 B/op 1 allocs/op

To create such a lookup table we are simply precalculating it in our application with.

import "github.com/x448/float16"

var f16LookupTable [65536]float32

func init() {
    // precompute float16 lookup table for faster conversion to float32
    for i := range f16LookupTable {
        f16 := float16.Frombits(uint16(i))
        f16LookupTable[i] = f16.Float32()
    }
}

Then converting our output buffer from uint16 to fp32 with.

func convertBufferToFloat32(float16Buf []uint16) []float32 {
    float32Buf := make([]float32, len(float16Buf))

    for i, val := range float16Buf {
        float32Buf[i] = f16LookupTable[val]
    }

    return float32Buf
}
x448 commented 3 weeks ago

@swdee Thanks for the suggestion! I will try this and see how it goes.

swdee commented 1 week ago

@x448 We have a CGO version as worked with @TailsFanLOL and discussed here.