WPT tests tracker - Githubissues

BruceDai commented 2 years ago

Thanks @fdwr's great efforts for reviewing and @Honry's approvals and helps, our WPT WebNN tests PRs have been all landed after previous blocker of syncing updated WebNN IDL interfaces on https://github.com/web-platform-tests/wpt/pull/36908 being resolved by my fixing CI failure PR.

Now there're such 432 WPT WebNN operations tests covered total 40 ops for first wave models after convTranspose2d tests https://github.com/web-platform-tests/wpt/pull/38100 being landed. We could run these tests on https://wpt.live/webnn/, eg:

run conv2d tests in Window: visit https://wpt.live/webnn/conv2d.https.any.html by Chromium integrated with WebNN API implementation
run conv2d tests in Worker: visit https://wpt.live/webnn/conv2d.https.any.worker.html by Chromium integrated with WebNN API implementation

Bruce is ongoing to add tests of remaining ops #338 closely co-working with @fdwr.

WPT WebNN Tests:

1. WebNN API IDL Tests：

2. WebNN API JavaScript Tests (testharness.js) for operations tests:

ULP tolerance assertion https://github.com/web-platform-tests/wpt/pull/35789
batchNormalization https://github.com/web-platform-tests/wpt/pull/37420
concat https://github.com/web-platform-tests/wpt/pull/36782
clamp https://github.com/web-platform-tests/wpt/pull/37067
conv2d https://github.com/web-platform-tests/wpt/pull/37275
convTranspose2d https://github.com/web-platform-tests/wpt/pull/38100
element-wise binary (add/sub/mul/div/max/min/pow) https://github.com/web-platform-tests/wpt/pull/37337
element-wise unary (abs/ceil/cos/exp/floor/log/neg/sin/tan) https://github.com/web-platform-tests/wpt/pull/37351
gemm https://github.com/web-platform-tests/wpt/pull/37068
leakyRelu https://github.com/web-platform-tests/wpt/pull/37340
averagepool2d/maxpool2d https://github.com/web-platform-tests/wpt/pull/37379
matmul https://github.com/web-platform-tests/wpt/pull/37069
reduction (reduceMax/reduceMean/reduceMin/reduceProduct/reduceSum) https://github.com/web-platform-tests/wpt/pull/37380
reshape https://github.com/web-platform-tests/wpt/pull/37070
relu https://github.com/web-platform-tests/wpt/pull/37071
sigmoid https://github.com/web-platform-tests/wpt/pull/37343
slice https://github.com/web-platform-tests/wpt/pull/37072
softmax https://github.com/web-platform-tests/wpt/pull/37254
split https://github.com/web-platform-tests/wpt/pull/37398
squeeze https://github.com/web-platform-tests/wpt/pull/37352
tanh https://github.com/web-platform-tests/wpt/pull/37344
transpose https://github.com/web-platform-tests/wpt/pull/37315

anssiko commented 2 years ago

@BruceDai, thanks for your contributions to conformance testing. I added webnn-baseline to today's agenda including discussion on ULP tolerances to unblock your work on this (I'm not expecting presentation, just discussion). The webnn-baseline is identified as a CR requirement, so high priority.

@wchao1115 @huningxin your feedback is welcome in this issue to unblock this proposed work. Since we have a busy agenda today, we may need to defer to GH discussion.

BruceDai commented 2 years ago

I'm sorry to report status late. According to testing ULP tolerances between actual output by WebNN operations with expected data/baseline by WebNN-Baseline on some different HW devices with WebNN-Native DML backend and OpenVINO backend, we observed there're majority ULP tolerances with normal input data and some large ULP distance with some special input data. Here I want to propose following majority ULP tolerances to WG.

@wchao1115 Please also take a look, and I hope that you would share your pervious operations ULP tolerances of DML, thanks.

Op	Propose ULP Tolerance
batchNormalization	5
clamp	0
concat	0
conv2d	2
add	1
sub	1
mul	1
div	2
max	0
min	0
pow	3
abs	0
ceil	0
cos	2
exp	2
floor	0
log	3
neg	0
sin	2
tan	4
gemm	1
leakyRelu	1
matmul	1
averagepool2d	2
maxpool2d	0
relu	0
reduceMax	0
reduceMean	0
reduceMin	0
reduceProduct	0
reduceSum	0
reshape	0
sigmoid	2
slice	0
softmax	1
split	0
squeeze	0
tanh	2
transpose	0

I‘ve firstly submitted a PR https://github.com/web-platform-tests/wpt/pull/34287 of adding tests of 8 operations (clamp / concat / relu / reshape / slice / split / squeeze / transpose ) which have 0ULP distance between actual output with expected data/baseline.

huningxin commented 2 years ago

As this is related to wpt which is cr blocker #240, I propose to label this issue with "cr". @anssiko

BruceDai commented 2 years ago

Link to https://github.com/webmachinelearning/webnn/issues/288

anssiko commented 2 years ago

[Piggy-packing on this issue with a more generic w-p-t question.]

@BruceDai, could you give us an update on where we are in terms of test coverage for WebNN API w-p-t tests?

Our plan is to migrate the mocha tests to wpt/webnn to satisfy CR readiness criteria tracked in https://github.com/webmachinelearning/webnn/issues/240.

Looking at the relevant wpt PRs it looks like the migration is in progress.

Do you foresee other blockers besides ULP tolerances discussed in this issue? Thanks for your contributions to w-p-t!

BruceDai commented 2 years ago

Hi @anssiko, sorry for late response due to the holidays.

Current WebNN API Spec defines 56 operations, WebNN-Baseline has already implemented 42 first wave ops, and WebNN-Polyfill has implemented mostly of them (50/56 including 42 first wave ops). I'm starting to add operation level tests from above listed 8/42 first wave ops. Here's a implemented tests table, please have a look, thanks.

Operations \ tests	WebNN-Baseline	WebNN-Polyfill	WPT	Note (Is first wave operation?)
batchNormalization	√	√	×	Yes
clamp	√	√	√	Yes
concat	√	√	√	Yes
conv2d	√	√	×	Yes
convTranspose2d	√(*)	√	×	Yes
add	√	√	×	Yes
sub	√	√	×	Yes
mul	√	√	×	Yes
div	√	√	×	Yes
max	√	√	×	Yes
min	√	√	×	Yes
pow	√	√	×	Yes
abs	√	√	×	Yes
ceil	√	√	×	Yes
cos	√	√	×	Yes
exp	√	√	×	Yes
floor	√	√	×	Yes
log	√	√	×	Yes
neg	√	√	×	Yes
sin	√	√	×	Yes
tan	√	√	×	Yes
gemm	√	√	×	Yes
gru	√	√	×	Yes
gruCell	√	√	×	Yes
hardSigmoid	×	×	×	No
hardSwish	×	√	×	No
instanceNormalization	×	√	×	No
leakyRelu	√	√	×	Yes
matmul	√	√	×	Yes
linear	×	×	×	No
pad	×	√	×	No
averagepool2d	√	√	×	Yes
maxpool2d	√	√	×	Yes
l2Pool2d	×	√	×	No
reduceL1	×	√	×	No
reduceL2	×	√	×	No
reduceLogSum	×	×	×	No
reduceLogSumExp	×	√	×	No
reduceMax	√	√	×	Yes
reduceMean	√	√	×	Yes
reduceMin	√	√	×	Yes
reduceProduct	√	√	×	Yes
reduceSum	√	√	×	Yes
reduceSumSquare	×	×	×	No
relu	√	√	√	Yes
resample2d	×	√	×	No
reshape	√	√	√	Yes
sigmoid	√	√	×	Yes
slice	√	√	√	Yes
softmax	√	√	×	Yes
softplus	×	×	×	No
softsign	×	×	×	No
split	√	√	√	Yes
squeeze	√	√	√	Yes
tanh	√	√	×	Yes
transpose	√	√	√	Yes

Note:

The Row having √ in Column WPT means that we've already added this operation tests into WPT with submitted PR
The Row having × in Column WPT & Yes in last Column means that we locally migrated this first wave op tests from WebNN-Polyfill to WPT WebNN tests on pending to submit by ULP tolerance
(*) convTranspose2d was split from conv2d, so WebNN-Baseline can implicitly support convTranspose2d by invoking conv2d with some options, I'll submit a PR of adding convTranspose2d implementation and updating relevant tests to enable WebNN-Baseline clearly support convTranspose2d

On my opinion, there isn't any other blocker except ULP tolerances which we're working on.

I plan to add first wave operations tests into WPT project firstly, then add tests for others operations which are under implementing on WebNN-Polyfill and WebNN-Baseline. Any suggestion, thanks.

anssiko commented 2 years ago

@BruceDai thank you for this update, your plan sounds good to me. Your wpt contributions play an important role in the CR readiness. Please bring any further blockers to the attention of the WG so we can help you address them in a timely manner.

anssiko commented 2 years ago

@BruceDai I'll make this a meta issue for WPT tests tracking and rename the issue to reflect that.

Please link the relevant issues and PRs into this meta issue to keep the WG informed of the progress (not everyone is watching the huge wpt repo). We'll review your test plan https://github.com/webmachinelearning/webnn/issues/265#issuecomment-1246622380 on our upcoming call. Thank you!

wchao1115 commented 2 years ago

@BruceDai We're close to producing an initial list of recommended ULP tolerance for the ops you're listing here. There will be some more explanation as to why we would recommend a certain tolerance value for certain ops in the list.

+= @fdwr.

fdwr commented 2 years ago

Hi BruceDai, here's the initial list...

Operators can be grouped into different categories, and the tolerances in the same category are generally similar:
- Data movement: slice, pad, concat, split, reshape, squeeze, unsqueeze, transpose, gather, scatter, padding, depthToSpace, spaceToDepth, topK, oneHot...
- Data generation: diagonalMatrix, fillValueSequence...
- Exact math: abs, neg, clamp, ceil, floor, min, max, relu, reduceMin, reduceMax, maxpoolNd...
- Simple math: add, subtract, multiply, divide, linear, leakyRelu, hardSigmoid, hardSwish...
- Complex math: exp, log, pow, softsign, softmax, softplus, sigmoid, sqrt...
- Trigonometric functions: sin, sinh, cos, cosh, tan, tanh...
- Lossy accumulation: convNd, convTransposeNd, gemm/matmul, batchNormalization, instanceNormalization, layerNormalization, reduceSum, reduceSumSquare, reduceMean, reduceProduct, reduceL1, reduceL2, reduceLogSum, reduceLogSumExp, resampleNd, averagePoolNd, l2PoolNd...
- Very complex iterative: gru, gruCell, lstm, rnn...
Note there are numerical gotchas to beware of, including subtraction of nearly equal numbers (catastrophic cancellation), division by very small numbers (which magnify earlier errors), adding very large and very small numbers (where the large numbers eat the small numbers completely), and asymptotes of trigonometric and nonlinear functions (bad things happen with 0/0 :b). These gotchas make it impossible to pick a single tolerance that works universally, and so they're best avoided with a little control of your input data. I'm not saying you should avoid using random data (that's still fine), but consider the range you generate it within. Otherwise you're not really testing operator behavior conformance, but rather you're just testing the rounding precision of the device (which also matters, but you don't want these to make your tests brittle). So for example, with the linear operator, you can still randomly generate the input, scale, and bias parameters, but ensure scale and bias have consistent signs (both positive or both negative, or else subtraction of nearly equal numbers will eventually bite you in some random permutation). For tangent, avoid querying too close to the repeating asymptotes of 1/4τ and 3/4τ.
For the lossy accumulation operators, the potential error grows depending on the number of input elements being sampled per output element ("IEPOE" below) whether it's along a reduction axis like reduceSum and gemm, or a sliding window like conv and averagePool, and so the upper limit for error depends on the parameters, not just a single hard-coded tolerance value. Beware you might witness a very low error running some of these operators and think the precision of the underlying computation is very good, but this is a lie, a false comfort due to round-to-nearest-evens' wonderful tendency to balance out error. You could sum 100 random numbers and get an actual value only a few ULP off from the expected value in the common case, but then you will eventually encounter some outliers that are pretty far off, because the error variance is still wider, and the worst case is broader (broader than say summing 10 numbers). Expectedly, the number of lossy math operations also contributes, not just the number inputs, and the values below are not as tight as they could be in practice, but it's about setting a reasonable upper limit.
Signals in the analog world have error, some which vary independent of the strength of the signal (like Gaussian noise in audio or video) bounded by an absolute tolerance (ATOL) and others which vary proportional to the magnitude of the signal bounded by a percentage/relative tolerance (RTOL) of the expected value. Similarly, graphing the error of software math functions will in some cases show error centered some range around the expected value (like with sine and cos which are often implemented via lookup tables with linear interpolation) and in other cases show error proportional to the magnitude of the input (like with convolution and multiplication). In computers, rather than use relative percentages (RTOL), we can instead use the bitwise delta between values to measure the unit's last place (ULP, which you are already familiar with). For ATOL, it's just actual <= expected + atol && actual >= expected - atol.
Neither ULP nor ATOL alone is sufficient to cover all the cases, as you'll have legitimate points (not asymptotes) on functions like log at x=1 and atan at x=0 which causes issues for ULP because of the division by nearly zero numbers. So you pick the right metric for the operator.
The tolerance can vary based on data type, float16 vs float32 (even if few of the columns below exhibit that). Additionally there are some processors that have more aberrant data types (not standard IEEE) like 12.12 fixed point math or float19of32 which only uses 19 bits and zeros in the bottom 13 bits.
GPU's may flush subnormals to zeros whereas CPU's preserve them. If you just compare the CPU result to the GPU result without zeroing subnormals first, you'll get a huge ULP difference; but you only want to zero the CPU result if the CPU result is a denorm and the GPU result is zero, because otherwise you'll get mismatches the other direction if you always zero denorms since GPU's don't zero them for float16.

Op	Old Proposed ULP Tolerance	float16	float32	notes
batchNormalization	5	6 ULP	6 ULP	`(a - mean) * scale / sqrt(variance + epsilon) + bias`
clamp	0	0	0	`if a > high then high elif a < low then low else a`
concat	0	0	0
conv2d	2	IEPOE*2 ULP	IEPOE*2 ULP	number of reduced input elements multiplied by filter and summed (a sliding dot product like pooling). So (Filter.Sizes.W Filter.Sizes.H (Input.Sizes.C / GroupCount)) 2. // FilterSize.D too if 3D
add	1	1 ULP	1 ULP
sub	1	1 ULP	1 ULP
mul	1	1 ULP	1 ULP
div	2	2 ULP	2 ULP	implementations may instead use x * (1/y), and so 1 for reciprocal and 1 for multiply
max	0	0	0
min	0	0	0
pow	3	2 ULP	32 ULP	May expand to `expₑ(b * log(a))`.
abs	0	0	0
ceil	0	0	0
cos	2	1/512 ATOL or 1 ULP	1/1024 ATOL
div	2	2 ULP	2 ULP
exp	2	1 ULP	32 ULP	ULP is typically very small (0 to 2), but negative values can yield larger deltas (e.g. exp(-36.7462921143) yields ULP± 27 on my machine). float16 is actually computed using float32 (so 1 ULP for final roundoff).
floor	0	0	0
log	3	1/1024 ATOL or 2 ULP	1/1024 ATOL or 2 ULP
neg	0	0	0
sin	2	1/512 ATOL or 1 ULP	1/1024 ATOL	a little looser than GPU specs
tan	4	1/512 ATOL or 1 ULP	1/1024 ATOL
gemm	1	IEPOE*2+3 ULP	IEPOE*2+3 ULP	`(dot(a[i, …], b[.., j]) * alpha) + (beta * C)`. If no optional C input and alpha/beta are identity, use matmul tolerance
leakyRelu	1	1 ULP	1 ULP	`if a >= 0 then a else a * alpha`
matmul	1	IEPOE*2 ULP	IEPOE*2 ULP	`dot(a[i, …], b[.., j])`
averagepool2d	2	IEPOE+2 ULP	IEPOE+2 ULP	number of reduced element additions and a final division
maxpool2d	0	0	0
relu	0	0	0	`max(a, 0)`
reduceMax	0	0	0
reduceMean	0	IEPOE+2 ULP	IEPOE+2 ULP	number of reduced element additions and a final division
reduceMin	0	0	0
reduceProduct	0	IEPOE ULP	IEPOE ULP	number of reduced multiplications
reduceSum	0	IEPOE ULP	IEPOE ULP	number of reduced additions
reshape	0	0	0
sigmoid	2	3	32+2	`1 / (1 + expₑ(-a))` float16's exp is done as float32 (leaving a few ULP for roundoff)
slice	0	0	0
softmax	1	IEPOE*3+3 ULP	IEPOE*3+3 ULP	`expₑ(a - reducemax(A, axes)) / reducesum(expₑ(A - reducemax(A, axes)), axis);` // equivalent expₑ(a) / sum(expₑ(A))
split	0	0	0
squeeze	0	0	0
tan	na	1/512 ATOL or 1 ULP	1/1024 ATOL	may expand to `sin(radians) / cos(radians)`
tanh	2	1/512 ATOL or 1 ULP	1/1024 ATOL
transpose	0	0	0

ATOL - absolute tolerance (expected within [actual - atol, actual + atol])
RTOL - relative tolerance not used, only mentioned for completeness (expected within `[actual (1-RTOL), actual * (1+RTOL)]`)
ULP - unit last place (expected.asRawBits within [actual.asRawBits - ulp, actual.asRawBits + ulp])
IEPOE - input elements per output element (depends on individual operator): e.g.
- GEMM = a.sizes.width (or b.sizes.height)
- Conv2D = filter.sizes.w filter.sizes.h (input.sizes.c / groupCount)
- Reduction = input sizes multiplied for each active axis
- Pooling = window size

Let me know if you have any questions. 🧐

(UPDATE: More continued here: https://github.com/webmachinelearning/webnn/issues/338#issuecomment-1419652594)

wchao1115 commented 2 years ago

Big thanks to @fdwr for your contribution. @BruceDai Please note that the proposed tolerances are all relative to an ideal baseline. In our WebML call earlier in the week, I believe we've agreed that the WPT test must be relative to a framework-agnostic reference implementation of WebNN.

I think we'll need a new repo under the webmachinelearning GitHub organization specifically to host the ref implementation for our WPT tests @anssiko and @huningxin Do you have any objection to that? This is something we can help too.

huningxin commented 2 years ago

Thanks much @fdwr , that's a significant contribution!

@wchao1115 , I agreed we should host the reference implementation that generates the ideal baseline results. I think that's the reason we created webnn-baseline repo and implemented the first-wave ops. These ops are implemented in JavaScript double precision calculation and follows the straightforward algorithms, such as conv2d.

BruceDai commented 2 years ago

Thanks much @fdwr and @wchao1115 !

What's the definition of "IEPOE"? It would be much helpful for me to implement "IEPOE" in JavaScript for WPT tests if there's an algorithm of that .
What's the concrete value for ATOL of float32 and float16? You mentioned RTOL, while actual <= expected + atol && actual >= expected - atol missed RTOL, should it be actual <= rtol * expected + atol && actual >= rtol * expected - atol? then what's the concrete value for RTOL if using RTOL?
Regarding to zeroing subnormals, we once met this case, I'm going to add a checking whether result number is a subnormal, and use zero (0.0) instead of subnormal.
About exp op, I had some observations on #288, it seems that the ULP tolerance being a fixed value doesn't apply to exp op @fdwr PTAL, thanks.
In WebNN API Spec, there're these three ops (batchNormalization / conv2d / convTranspose2d) having fused activation option, what's the ULP tolerance for these ops if they used fused activation option? For example, conv2d fusing sigmoid activation, what's ULP tolerance for this float32 case, should it still follow IEPOE*2 ULP tolerance of conv2d or follow 3 ULP of sigmoid?

wchao1115 commented 2 years ago

@BruceDai It might be more time-efficient if we would arrange a short 15 minutes presentation at the next WG call to walk through and QA over this topic. @anssiko what do you think?

anssiko commented 2 years ago

@wchao1115 I'll put @fdwr on the agenda for our next 6 Oct call, working title "Recommended tolerances for WPT tests".

BruceDai commented 1 year ago

Thanks @fdwr ! I updated last PR https://github.com/web-platform-tests/wpt/pull/34287 following above precision-metrics suggestions, updated existed data movement ops float32 tests which use ULP metrics, added tanh op float32 tests which uses ATOL metrics and gemm op float32 tests which uses IEPOE metrics, this PR is under reviewing. And others float32 tests for remaining first-wave ops of https://github.com/web-platform-tests/wpt/pull/36202 have been updating test data (float64 inputs + float32 baseline) and precision-metrics.

BruceDai commented 1 year ago

Feng discussed with @fdwr about move test data onto separated JSON files which would make maintain tests easily later. now PR https://github.com/web-platform-tests/wpt/pull/36782 was submitted for reviewing, others tests would add soon.

anssiko commented 1 year ago

@BruceDai thanks for your continued work on WebNN WPT. Can you help answer the following questions:

What is the estimated test coverage (roughly) once we have addressed the open issues documented in https://github.com/webmachinelearning/webnn-baseline/issues
Any specific open issues you'd like to bring to the next WG meeting for discussion?
Any open PRs https://github.com/webmachinelearning/webnn-baseline/pulls that'd require special attention from the WG participants other than @huningxin @fdwr who are already looped in. (Thanks both for your help on this!).

I'm trying to identify opportunities to broaden our WPT contributor base. I'm aware of participants who are eager to get our remaining CR tasks completed and may be able to help in various capacities.

BruceDai commented 1 year ago

What is the estimated test coverage (roughly) once we have addressed the open issues documented in https://github.com/webmachinelearning/webnn-baseline/issues

The opened issues cover rest of ops which are unimplemented in WebNN-Baseline, if they're fixed, we could leverage these pure JavaScript implementations to get baseline test data for contributing op tests onto wpt.

Any specific open issues you'd like to bring to the next WG meeting for discussion?

Current I have none open about tests, I'm still focusing on refining and adding first wave operations tests onto wpt.

Any open PRs https://github.com/webmachinelearning/webnn-baseline/pulls that'd require special attention from the WG participants other than @huningxin @fdwr who are already looped in.

Since I've been refining test JSON files of wpt WebNN tests PRs according to feedbacks, those open PRs of WebNN-Basline are also updating, once finished I would ask @huningxin and @fdwr to help review. And experts and engineers are welcome to join for implantation and reviewing.

I'm trying to identify opportunities to broaden our WPT contributor base. I'm aware of participants who are eager to get our remaining CR tasks completed and may be able to help in various capacities.

Thanks @anssiko. Looking forward more contributors, hope we fix CR tasks ASAP :)

BruceDai commented 1 year ago

@anssiko I updated first top comment, please take a look, thanks.

BTW, may we close this issue and track on #338? Thanks.

anssiko commented 1 year ago

@BruceDai @fdwr, others, with your continued contributions we are able to not just meet but exceed the test coverage expectations for the Candidate Recommendation maturity level. Thanks for your contributions and congratulation on reaching this major wpt milestone! This is pioneering work for wpt due to domain-specific requirements of this API.

I'll close this tracker now and we continue track the remaining work in #338 focusing on two remaining ops.

webmachinelearning / webnn

WPT tests tracker #265