WebNN should support int8 quantized models

webmachinelearning / webnn

🧠 Web Neural Network API

https://www.w3.org/TR/webnn/

Other

354 stars 45 forks source link

WebNN should support int8 quantized models #128

Open wchao1115 opened 3 years ago

wchao1115 commented 3 years ago

Supporting int8 quantized models is essential for mobile scenarios and in many NPU architectures. TensorFlow (Lite) and ONNX, for instances, have int8 quantization support built-in, and WebNN should to. Related #93

anssiko commented 2 years ago

@wchao1115 @huningxin do you think we should label this as "cr" for https://github.com/webmachinelearning/webnn/issues/240 purposes?

huningxin commented 2 years ago

I think this is important one and support to label as "cr".

anssiko commented 2 years ago

@wchao1115 this issue was on the agenda today, but we had to defer due to timing. Let us know your thoughts. I'm planning to bring this up for our next meeting for discussion.

anssiko commented 2 years ago

Per discussion at https://www.w3.org/2022/03/24-webmachinelearning-minutes.html#t06 we consider this to be in scope for CR.

anssiko commented 1 year ago

We've discussed this feature on our recent meetings: https://www.w3.org/2022/09/22-webmachinelearning-minutes.html#t05 https://www.w3.org/2022/09/08-webmachinelearning-minutes.html#t05 https://www.w3.org/2022/08/25-webmachinelearning-minutes.html#t06

I will label this issue as "v2" due to required implementation experience for the initial CR inclusion. There's a mechanism for us to publish a Candidate Recommendation Draft subsequent to the initial CR that would give us adequate time to properly define, develop and test this feature.

Furthermore, we should soon start discussing WebNN "v2" plan as we look to extend our current charter and this feature could be one concrete feature to highlight. We can continue discuss this feature on our bi-weekly calls when there's new information and revise our position as appropriate.

inexorabletash commented 4 months ago

It looks like this was added to the spec in 0970115398d82f01e7421057c47afdef9478543c and we may have some implementation experience at this point. Close, despite it being marked v2 ?

huningxin commented 4 months ago

The int8 quantized models may need some extra ops, for example DynamicQuantizeLinear, DequantizeLinear, ConvInteger and MatMulInteger, that are missed in current spec.

Transformer Models Analysis spread sheet has more details of ops required by int8 quantized model (see columns marked with (int8)).

@fdwr @Honry

fdwr commented 4 months ago

The int8 quantized models may need some extra ops, for example DynamicQuantizeLinear, DequantizeLinear, ConvInteger and MatMulInteger, that are missed in current spec.

Indeed, I have those 4 prototyped here (a minimal first four): https://github.com/fdwr/chromium-src-webnn-dml/pull/1/files#diff-e1b2517a6ae8f7c4494c75d17c8650b56e4f8d430f54f5e1f765475f00a5e1f3R427-R433

wacky6 commented 3 months ago

Seems int4 quantization is also a thing (with negligible impact on output quality). int4 practically halfs the VRAM requirement of the model, and offers a speedup on devices that support them.

Example of a int4 quantization model: https://huggingface.co/01-ai/Yi-6B-Chat-4bits

Should this be considered for v2? Or is int4 too specific? (I'm not sure if 4bit is adequate for image or audio models)

// There's a more aggressive {-1,0,1} quantization. It's fairly new, and I believe it's application is limited to language models.

inexorabletash commented 3 months ago

The BitNet paper was really cool. https://arxiv.org/abs/2310.11453