Open TailsFanLOL opened 5 months ago
Nvm, 12-th gen had this for a brief moment added by accident, but then got removed from later revisions. This probably means it is going to be in other upcoming Core series CPUs. It was already a thing for Sapphire Rapids Xeons.
I also opened an upstream issue.
In my use case we have float16 tensor outputs from a NPU on the RK3588 (Arm processor). ARM does have NEON SIMD instructions to hardware accelerate the conversion from fp16 to fp32. We can't make use of those extensions with Go as the compiler does not support SIMD instructions.
Via CGO you can interface with the ARM Compute library to make use of these instructions, however for our use case which involves converting 856,800 bytes from uint16->fp32 per video frame this is much slower than sticking with pure Go in this library.
However better performance is still attainable by using a precalculated lookup table for the uint16->fp32 conversion.
On the RK3588 we get a 35% performance improvement.
BenchmarkF16toF32NormalConversion-8 150 7872802 ns/op 1720348 B/op 1 allocs/op
BenchmarkF16toF32LookupConversion-8 218 5123550 ns/op 1720342 B/op 1 allocs/op
And on a Threadripper workstation we get a 69% improvement.
BenchmarkF16toF32NormalConversion-20 1302 916041 ns/op 1720322 B/op 1 allocs/op
BenchmarkF16toF32LookupConversion-20 3919 275437 ns/op 1720335 B/op 1 allocs/op
To create such a lookup table we are simply precalculating it in our application with.
import "github.com/x448/float16"
var f16LookupTable [65536]float32
func init() {
// precompute float16 lookup table for faster conversion to float32
for i := range f16LookupTable {
f16 := float16.Frombits(uint16(i))
f16LookupTable[i] = f16.Float32()
}
}
Then converting our output buffer from uint16 to fp32 with.
func convertBufferToFloat32(float16Buf []uint16) []float32 {
float32Buf := make([]float32, len(float16Buf))
for i, val := range float16Buf {
float32Buf[i] = f16LookupTable[val]
}
return float32Buf
}
@swdee Thanks for the suggestion! I will try this and see how it goes.
@x448 We have a CGO version as worked with @TailsFanLOL and discussed here.
Hey! Can this use hardware instructions for conversion? Intel CPUs support hardware conversion since 2013, and the new 12-th gen also has support for arithmetic (I think?). Other architectures had that a while ago.
This might be possible without compiler support using embedded C code, but wouldn't that be out of scope for this?