WebNN should support int8 quantized models #128

wchao1115 · 2020-12-14T17:45:29Z

Supporting int8 quantized models is essential for mobile scenarios and in many NPU architectures. TensorFlow (Lite) and ONNX, for instances, have int8 quantization support built-in, and WebNN should to. Related #93

anssiko · 2022-02-11T10:03:19Z

@wchao1115 @huningxin do you think we should label this as "cr" for #240 purposes?

huningxin · 2022-02-17T06:24:43Z

I think this is important one and support to label as "cr".

anssiko · 2022-02-24T16:14:42Z

@wchao1115 this issue was on the agenda today, but we had to defer due to timing. Let us know your thoughts. I'm planning to bring this up for our next meeting for discussion.

anssiko · 2022-03-24T16:07:35Z

Per discussion at https://www.w3.org/2022/03/24-webmachinelearning-minutes.html#t06 we consider this to be in scope for CR.

anssiko · 2022-09-28T08:34:05Z

We've discussed this feature on our recent meetings:
https://www.w3.org/2022/09/22-webmachinelearning-minutes.html#t05
https://www.w3.org/2022/09/08-webmachinelearning-minutes.html#t05
https://www.w3.org/2022/08/25-webmachinelearning-minutes.html#t06

I will label this issue as "v2" due to required implementation experience for the initial CR inclusion. There's a mechanism for us to publish a Candidate Recommendation Draft subsequent to the initial CR that would give us adequate time to properly define, develop and test this feature.

Furthermore, we should soon start discussing WebNN "v2" plan as we look to extend our current charter and this feature could be one concrete feature to highlight. We can continue discuss this feature on our bi-weekly calls when there's new information and revise our position as appropriate.

An XNNPACK Subgraph uses Values to represent the tensor data produced and consumed by Nodes. This CL implements the methods that help define XNNPACK Values for MLOperands. That includes the external Values for graph’s input and output operands, the static Values for constant operands and internal Values for intermediate operands. These methods are used by MLGraphXnnpack::CreateXnnSubgraphAndRuntime() method that visits graph’s operators in topological order and defines XNNPACK Values for the input and output operands of each operator. This CL initially supports defining XNNPACK Values for float32 and float16 MLOperandType. The quantized integer types support will be implemented as a WebNN V2 feature [1]. This CL also implements the DefineXnnpackValuesTest that covers the definitions of different types of XNNPACK Values in various WebNN graph topology. [1]: webmachinelearning/webnn#128 Bug: 1273291 Change-Id: I3e9ec7e7524705bdf436ef8bf5c07f6b072c2dae Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3977358 Commit-Queue: ningxin hu <ningxin.hu@intel.com> Reviewed-by: Jiewei Qian <qjw@chromium.org> Cr-Commit-Position: refs/heads/main@{#1082138}

inexorabletash · 2024-02-21T18:45:10Z

It looks like this was added to the spec in 0970115 and we may have some implementation experience at this point. Close, despite it being marked v2 ?

huningxin · 2024-02-22T00:58:32Z

The int8 quantized models may need some extra ops, for example DynamicQuantizeLinear, DequantizeLinear, ConvInteger and MatMulInteger, that are missed in current spec.

Transformer Models Analysis spread sheet has more details of ops required by int8 quantized model (see columns marked with (int8)).

@fdwr @Honry

fdwr · 2024-02-22T04:46:04Z

The int8 quantized models may need some extra ops, for example DynamicQuantizeLinear, DequantizeLinear, ConvInteger and MatMulInteger, that are missed in current spec.

Indeed, I have those 4 prototyped here (a minimal first four): https://github.com/fdwr/chromium-src-webnn-dml/pull/1/files#diff-e1b2517a6ae8f7c4494c75d17c8650b56e4f8d430f54f5e1f765475f00a5e1f3R427-R433

wacky6 · 2024-03-14T03:09:19Z

Seems int4 quantization is also a thing (with negligible impact on output quality). int4 practically halfs the VRAM requirement of the model, and offers a speedup on devices that support them.

Example of a int4 quantization model: https://huggingface.co/01-ai/Yi-6B-Chat-4bits

Should this be considered for v2? Or is int4 too specific? (I'm not sure if 4bit is adequate for image or audio models)

// There's a more aggressive {-1,0,1} quantization. It's fairly new, and I believe it's application is limited to language models.

inexorabletash · 2024-03-14T04:41:29Z

The BitNet paper was really cool. https://arxiv.org/abs/2310.11453

inexorabletash · 2024-07-26T17:39:22Z

The int8 quantized models may need some extra ops, for example DynamicQuantizeLinear, DequantizeLinear, ConvInteger and MatMulInteger, that are missed in current spec.

Indeed, I have those 4 prototyped here (a minimal first four): https://github.com/fdwr/chromium-src-webnn-dml/pull/1/files#diff-e1b2517a6ae8f7c4494c75d17c8650b56e4f8d430f54f5e1f765475f00a5e1f3R427-R433

Hey @fdwr - how fresh is your prototype of these? And have you looked at how other backends (CoreML, TFLite) would implement these? Starting with the "minimum viable" quantization support as outlined in #623 is appealing!

fdwr · 2024-08-02T06:17:18Z

@inexorabletash

how fresh is your prototype

It's moldy bread by now (but snippets could be reused). The ORT WebNN EP implementation still exists (it was originally added during prototyping) and would light up again once the op is added into Chromium.

And have you looked at how other backends (CoreML, TFLite)

There are differences, but they should be expressible (🤞). For dequantization, most decompose to output = mul(sub(input, zeroPoint), scale) (except TF full, CoreML MIL's LUT mode, and CoreML's scale&bias form). They have differing broadcasting rules, which I'd like to iron out to be more consistent (consistent with unidirectional broadcasting of its decomposition and expand).

API Name	Equation	Types
TFLite DequantizeOp	real = (quantized - zeroPoint) * scale (link)	input: uint4, uint8, int8, int16, float16 zeroPoint: uint8 scale: float32 output: float32
TF tf.quantization.dequantize	output = minRange + (input * (maxRange - minRange) / dataTypeRange)	input: uint8 minRange: float32 maxRange: float32 dataTypeRange: int output: float32
CoreML MIL constexpr_affine_dequantize	real = (input - zeroPoint) * scale	input: uint8, int8 zeroPoint: uint8, int8, float32 scale: same as output output: float16, float32
CoreML MIL constexpr_lut_to_dense	real = lut[input]	input: uint1, uint2, uint4, uint6, uint8 output: uint8, int8, float16, float32
CoreML LinearQuantizationParams	? input scale + bias* ?	input: ? scale: float32 bias: float32 output: ?
ONNX DequantizeLinear	real = (input - zeroPoint) * scale	input: uint4, int4, uint8, int8, uint16, int16, int32, float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz zeroPoint: same as input scale: same as output output: bfloat16, float16, float32
DML DEQUANTIZE_LINEAR	real = (input - zeroPoint) * scale	input: uint4, int4, uint8, int8, uint16, int16, uint32, int32 zeroPoint: same as input scale: same as output output: float16, float32

reillyeon · 2024-09-23T19:19:43Z

Discussed at the TPAC 2024 F2F. Group consensus was to implement QDQ operators for int8 and int4. Deduplicating with #93.

anssiko added the cr label Mar 24, 2022

anssiko added v2 and removed cr labels Sep 28, 2022

inexorabletash added the feature request label Feb 1, 2024

inexorabletash added the opset label Feb 22, 2024

wchao1115 mentioned this issue Mar 27, 2024

WebNN should support NPU and QDQ operations #623

Open

inexorabletash mentioned this issue Jul 12, 2024

WebML WG - TPAC 2024 agenda webmachinelearning/meetings#25

Open

fdwr mentioned this issue Aug 16, 2024

Add QuantizeLinear and DequantizeLinear for mixed precision #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebNN should support int8 quantized models #128

WebNN should support int8 quantized models #128

wchao1115 commented Dec 14, 2020

anssiko commented Feb 11, 2022

huningxin commented Feb 17, 2022

anssiko commented Feb 24, 2022

anssiko commented Mar 24, 2022

anssiko commented Sep 28, 2022

inexorabletash commented Feb 21, 2024

huningxin commented Feb 22, 2024

fdwr commented Feb 22, 2024

wacky6 commented Mar 14, 2024

inexorabletash commented Mar 14, 2024

inexorabletash commented Jul 26, 2024

fdwr commented Aug 2, 2024 •

edited

Loading

reillyeon commented Sep 23, 2024

WebNN should support int8 quantized models #128

WebNN should support int8 quantized models #128

Comments

wchao1115 commented Dec 14, 2020

anssiko commented Feb 11, 2022

huningxin commented Feb 17, 2022

anssiko commented Feb 24, 2022

anssiko commented Mar 24, 2022

anssiko commented Sep 28, 2022

inexorabletash commented Feb 21, 2024

huningxin commented Feb 22, 2024

fdwr commented Feb 22, 2024

wacky6 commented Mar 14, 2024

inexorabletash commented Mar 14, 2024

inexorabletash commented Jul 26, 2024

fdwr commented Aug 2, 2024 • edited Loading

reillyeon commented Sep 23, 2024

fdwr commented Aug 2, 2024 •

edited

Loading