Open wchao1115 opened 3 years ago
@wchao1115 @huningxin do you think we should label this as "cr" for https://github.com/webmachinelearning/webnn/issues/240 purposes?
I think this is important one and support to label as "cr".
@wchao1115 this issue was on the agenda today, but we had to defer due to timing. Let us know your thoughts. I'm planning to bring this up for our next meeting for discussion.
Per discussion at https://www.w3.org/2022/03/24-webmachinelearning-minutes.html#t06 we consider this to be in scope for CR.
We've discussed this feature on our recent meetings: https://www.w3.org/2022/09/22-webmachinelearning-minutes.html#t05 https://www.w3.org/2022/09/08-webmachinelearning-minutes.html#t05 https://www.w3.org/2022/08/25-webmachinelearning-minutes.html#t06
I will label this issue as "v2" due to required implementation experience for the initial CR inclusion. There's a mechanism for us to publish a Candidate Recommendation Draft subsequent to the initial CR that would give us adequate time to properly define, develop and test this feature.
Furthermore, we should soon start discussing WebNN "v2" plan as we look to extend our current charter and this feature could be one concrete feature to highlight. We can continue discuss this feature on our bi-weekly calls when there's new information and revise our position as appropriate.
It looks like this was added to the spec in 0970115398d82f01e7421057c47afdef9478543c and we may have some implementation experience at this point. Close, despite it being marked v2 ?
The int8 quantized models may need some extra ops, for example DynamicQuantizeLinear
, DequantizeLinear
, ConvInteger
and MatMulInteger
, that are missed in current spec.
Transformer Models Analysis spread sheet has more details of ops required by int8 quantized model (see columns marked with (int8)).
@fdwr @Honry
The int8 quantized models may need some extra ops, for example
DynamicQuantizeLinear
,DequantizeLinear
,ConvInteger
andMatMulInteger
, that are missed in current spec.
Indeed, I have those 4 prototyped here (a minimal first four): https://github.com/fdwr/chromium-src-webnn-dml/pull/1/files#diff-e1b2517a6ae8f7c4494c75d17c8650b56e4f8d430f54f5e1f765475f00a5e1f3R427-R433
Seems int4 quantization is also a thing (with negligible impact on output quality). int4 practically halfs the VRAM requirement of the model, and offers a speedup on devices that support them.
Example of a int4 quantization model: https://huggingface.co/01-ai/Yi-6B-Chat-4bits
Should this be considered for v2? Or is int4 too specific? (I'm not sure if 4bit is adequate for image or audio models)
// There's a more aggressive {-1,0,1} quantization. It's fairly new, and I believe it's application is limited to language models.
The BitNet paper was really cool. https://arxiv.org/abs/2310.11453
Supporting int8 quantized models is essential for mobile scenarios and in many NPU architectures. TensorFlow (Lite) and ONNX, for instances, have int8 quantization support built-in, and WebNN should to. Related #93