Support for bfloat16 - Githubissues

x448 / float16

float16 provides IEEE 754 half-precision format (binary16) with correct conversions to/from float32

MIT License

66 stars 8 forks source link

Support for bfloat16 #22

Open tisnik opened 4 years ago

tisnik commented 4 years ago

Thank you for making this very useful and well-tested library! Are you planning to add support for bfloat16 format, which is used in ML field? It has different bit widths for mantissa and exponent, but other rules are the same as in IEEE 754 formats.

x448 commented 4 years ago

Hi Pavel, I took a quick glance at bfloat16. If I implement it, I think it would be in a separate project.

There would have to be a convenient way for me to compare results with a hardware implementation. I'd like to be able to confirm 100% of float32<-->bfloat16 conversions.

float16 was very convenient because the vm I use for coding had hardware instructions (F16C aka FP16C).

tisnik commented 4 years ago

Thank you for a quick response. Yes, it totally make sense to create bfloat16 as separate project (thought IMHO most of the code will be very similar). As far as I know, bfloat16 is supported in AVX-512 - VCVTNE2PS2BF16, VCVTNEPS2BF16 and VDPBF16PS instructions, but I have not tried them (and very probably it won't be possible to use Go assembler with those pretty new instructions). I planned to create some conversion library myself, but I'm not sure how to handle special cases like denormalized values, sNaNs, qNaNs etc. as some bfloat16 implementations don't follow all IEEE 754 rules.

agj32mrgibbits commented 2 years ago

Looks to me like bfloat16 conversion between float32 is a simple and fast shift:

type BFloat16 uint16

func ToFloat32(x BFloat16) float32 {
    return math.Float32frombits(uint32(x) << 16)
}

func FromFloat32(x float32) BFloat16 {
    return BFloat16(math.Float32bits(x) >> 16)
}

func FromBits(u16 uint16) BFloat16 {
    return BFloat16(u16)
}

func Bits(f BFloat16) uint16 {
    return uint16(f)
}

func (f BFloat16) String() string {
    return strconv.FormatFloat(float64(ToFloat32(f)), 'f', -1, 32)
}

https://go.dev/play/p/jhXQvuI9Pxz

fxamacker commented 6 months ago

Support for bfloat16 is also requested in comments at #46

x448 commented 6 months ago

bfloat16 and patents:

janpfeifer commented 3 months ago

Is creating a bfloat16 package still in your plans ?

I'm trying to port Gemma model to GoMLX, and everything seems to be bfloat16 there.

I'm using your suggested code above, since the truth is most numeric computations happen in XLA/GPU anyway, but, it would be nice to source the type from the same owner 😄

janpfeifer commented 3 months ago

As a temporary measure I created the simple github.com/gomlx/gopjrt/dtypes/bfloat16 package -- which contains what I need immediately.

Frustrating the patent trolling story with the bfloat16 ... not sure what to make out of it ...