webmachinelearning / webnn-polyfill

🧠⚙️ Web Neural Network API polyfill based on TensorFlow.js
https://www.npmjs.com/package/@webmachinelearning/webnn-polyfill
Apache License 2.0
102 stars 18 forks source link

Max ULP distance for WebNN ops with pr #139 - ULP comparison #144

Open BruceDai opened 2 years ago

BruceDai commented 2 years ago

Since CPU use Double Precision, I collected max ULP distance for WebNN ops using result of CPU backend (tfjs-backend-cpu based) as the baseline with pr #139 - ULP comparison on three devices.

Here're some observations:

  1. Same max distance on Wasm backend for each op
  2. Various max distance on WebGL backend for some op, e.g., cos, sin, etc.
  3. Stable max distance on both Wasm and WebGL backend for some ops, such as concat, gemm, etc.
  4. The max distance of those ops have activation option is affected by activation op, such as conv2d, batchNormalization
  5. The output of relu(negative number) on Device 1 & 2 + WebGL backend is -0.0, distance is 2147483648 to the baseline 0.0

Open:

  1. ULP distance is various with input/parameter devices for some ops, how to decide acceptable ULP distance for them?
  2. Executing relu operation with negative input, some devices compute out -0.0 while expected is 0.0 on CPU backend, the ULP distance between -0.0 and baseline 0.0 is 2147483648, while with non-negative input the max distance is 0, how to decide acceptable ULP distance for relu op?

Distance details:

Op Wasm Device1 WebGL Device1 Wasm Device2 WebGL Device2 Wasm Device3 WebGL Device3
abs 0 0 0 0 0 0
ceil 0 0 0 0 0 0
cos 3 590 3 4 3 4
exp 1 0 1 0 1 0
floor 0 0 0 0 0 0
log 1 0 1 0 1 0
neg 0 0 0 0 0 0
sin 0 670 0 14 0 4
tan 4 284 4 14 4 5
add 0 0 0 0 0 0
sub 0 0 0 0 0 0
mul 0 0 0 0 0 0
div 0 1 0 1 0 1
max 0 0 0 0 0 0
min 0 0 0 0 0 0
pow ** 0.5 0 2 0 2 0 4
pow **30 1 73 1 73 1 73
pow **50 1 51 1 51 1 71
batchNormalization 289 314 289 314 289 275
clamp 0 0 0 0 0 0
conv2d 1 1 1 0 1 2
conv2d(fused sigmoid) 4320708* 0 4320708* 0 4320708* 0
sigmoid 4320708* 0 4320708* 0 4320708* 0
relu 0 21474836482# 0 21474836482# 0 0
matmul 28 14 28 18 28 28
hardSwish 0 0 0 0 0 1
averagepool2d 2 2 2 2 2 1
l2pool2d 2 2 2 2 2 0
maxpool2d 0 0 0 0 0 0
gemm 1 1 1 1 1 1
gruCell 4 15 4 15 4 15
gru 4 9 4 9 4 17
instanceNormalization 128 128 128 128 128 128
leakyRelu 1 1 1 1 1 1
pad 0 0 0 0 0 0
reduceL1 0 1 0 1 0 1
reduceL2 0 1 0 1 0 2
reduceLogSumExp 0 0 0 0 0 0
reduceMax 0 0 0 0 0 0
reduceMean 0 0 0 0 0 0
reduceMin 0 0 0 0 0 0
reduceProduct 0 0 0 0 0 0
reduceSum 0 0 0 0 0 0
tanh 1 11 1 11 1 1
concat 0 0 0 0 0 0
reshape 0 0 0 0 0 0
resample2d 0 0 0 0 0 0
slice 0 0 0 0 0 0
split 0 0 0 0 0 0
squeeze 0 0 0 0 0 0
transpose 0 0 0 0 0 0
Note:  
* distance 4320708 is between actual output 6.054601485195952e-39 and baseline 0.0
# distance 2147483648 is actual output -0.0 against baseline 0.0 with negative number input
BruceDai commented 2 years ago

@huningxin PTAL, thanks.

huningxin commented 2 years ago

This is great input for this week's WG discussion on conformance testing. Thanks much @BruceDai .

/cc @anssiko @wchao1115 @dontcallmedom

wchao1115 commented 2 years ago

@BruceDai Can you please explain what do you mean by "max ULP distance" and how do you plan to use it?

Are we looking to use the result from the WebNN-native on CPU as our baseline?

huningxin commented 2 years ago

Are we looking to use the result from the WebNN-native on CPU as our baseline?

Bruce is using the result of WebNN-polyfill CPU backend as the baseline. WebNN-poyfill CPU is based on TF.js CPU backend which uses JavaScript numbers to calculate kernels. I suppose the results should have double precision. /cc @pyu10055

Regarding to current WebNN-native CPU backends, say OpenVINO CPU, XNNPACK and oneDNN, I understand they are single precision and might not meet the baseline requirement.

BruceDai commented 2 years ago

@wchao1115 The max ULP distance means the max one of those various distance results between actual output and baseline.

Here's a sample of pow op tested on WebGL backend.

# use random data as input
input = [0.33435354, 0.57139647, 0.03689031]
exponent =30,
// use result by CPU backend (tfjs-backend-cpu based) as baseline
baseline = [
5.323259448666113e-15,
5.106538125687621e-8,
1.0229478789571165e-43]
// actual output = pow(input, 30)
actual output =[
5.323248437237799e-15,
 5.1065363493307814e-8,
0.0] 

ULP distance between actual output with baseline:
   ULP distance between 5.323248437237799e-15 and 5.323259448666113e-15 is  26
   ULP distance between 5.1065363493307814e-8 and 5.106538125687621e-8 is  5
   ULP distance between 0.0 and 1.0229478789571165e-43 is 73

Among of these three ULP distance numbers (26, 5, 73) , current max ULP distance is 73.

There's a problem that max ULP distance would update with

What's the strategy for defining acceptable ULP distance?

wchao1115 commented 2 years ago

@BruceDai Your ULP values seem high. For reference, the pow operator in DirectML GPU conformance test has a ULP tolerance of 0 (exact) for single-precision compare, 2 for half-precision (float16) via single-precision compute compare, and 4 for half-precision via half-precision compute compare.

@huningxin Are you sure that the baseline result here is from a pure double-precision compute on the CPU? If the baseline is indeed from a double precision result, then you will need to truncate it down to a single-precision value before comparing it with the single-precision result from the WebGL backend. The 2 inputs to the CompareUlp function must be of the same type.

wchao1115 commented 2 years ago

I'll be happy to add a DirectML column with our ULP values to your table above if it helps. Note that you'll need at least 2 tables, one for float32 and another for float16 results. DirectML further differentiates float16 result into two modes -- the result on a float16 tensor from a float32 calculation, and the result on a float16 tensor from a float16 calculation. (We actually break the latter category further down to a float16 multiplication with float16 accumulation, vs. float16 multiplication with float32 accumulation, but let's leave that detail for now)

huningxin commented 2 years ago

Are you sure that the baseline result here is from a pure double-precision compute on the CPU?

I believe so, because AFAIK, the JavaScript performs double-precision arithmetic calculations. And tfjs-backend-cpu kernels' calculation is implemented in JavaScript.

If the baseline is indeed from a double precision result, then you will need to truncate it down to a single-precision value before comparing it with the single-precision result from the WebGL backend.

I suppose this is also true, because the double-precision results are stored back to a Float32Array before comparing with other single-precision result, e.g., from WebGL backend.

huningxin commented 2 years ago

We probably could compute the baseline by JavaScript along with the test cases (part of the w-p-t).

For Bruce's pow example, the baseline could be computed simply by Math.pow. The code sketch could be:

const input = [0.33435354, 0.57139647, 0.03689031];
const exponent =30;

// Compute the double-precision baseline
const baseline = input.map(x => Math.pow(x, exponent));
// baseline = [5.323261130422279e-15, 5.1065382759817323e-8, 1.0171128528373136e-43]

// Truncate the double-precision baseline to single-precision
const baselineInFloat32 = new Float32Array(baseline)
// baselineInFloat32 = [5.323261142732008e-15, 5.106538125687621e-8, 1.0229478789571165e-43]

// Then do ULP comparison with results of WebNN pow 

This is an extremely simplified example. The baseline of other complex ops would require more efforts to implement the compute kernel. As a reference, the tf.js conv2d JS kernel is ~150 LOC. The efforts might be deserved, because this could help us establish a baseline that would meet the requirements raised in WebML WG Teleconference – 2 Dec 2021, like

by @wchao1115

all computation is done in double precision we don't want any intermediate casting an open source ref, anyone can look at the code and be confident

by @dontcallmedom

the codebase should be easy to review, not too many layers of abstraction

Any thoughts?

dontcallmedom commented 2 years ago

this would match what I have in mind, indeed.

FWIW, while maintaining it in WPT is an option, I think we don't need make this a requirement - at the end of the day, what is needed in WPT is only the results of the computation, not the computation code itself.

In particular, given the amount of the WPT-specific infrastructure in that repo, we might be better served by a lighterweight dedicated repo to build and audit the baseline.

BruceDai commented 2 years ago

I developed experimental double precision baseline implementation of element binary options referring to https://github.com/tensorflow/tfjs codes.

Here're result shortcuts of running float32 binary tests using WebNN-native backend (DML-GPU and OpenVINO-CPU) under criteria of 2 ULP distance for each element binary options .

There're 22 binary tests, 16 Pass and 6 Fail on DML backend, 17 Pass and 5 Fail on DML backend Observed that the output behaviors of pow op are not same on these two backends.

DML-GPU

figure-1 testing by DirectML backend (GPU)

OpenVINO-CPU

figure-2 testing by OpenVINO backend (CPU)

@wchao1115 @huningxin @dontcallmedom PTAL, thanks.