Closed anssiko closed 2 months ago
Thanks @anssiko !
Per discussions of executing mode:
@dsmilkov mentioned in 14 Feb 2019 call
I like the idea that you could execute a single operation, or a chain of operations
@RafaelCintron wrote in https://github.com/webmachinelearning/webnn/issues/3#issuecomment-457886464:
- The operator-style API should allow web developers to connect graph nodes together to allow the browser to optimize the graph.
It looks like there are three options:
As a input, here is a summary of the platform APIs and their executing mode that we prototyped in our WebNN POC.
API | executing object | target device |
---|---|---|
NNAPI | graph | hardware agnostic (CPU, GPU, accelerator) |
MPS | chain of ops | GPU |
MPSNNGraph | graph | GPU |
BNNS | op | CPU |
MKL-DNN | chain of ops | CPU |
clDNN | graph | GPU |
Thanks @anssiko for following up on this! Apologies for the delay.
I want to emphasize that this proposal doesn't need us to define a graph format and/or serialization protocol for models. It doesn't preclude it either. In this proposal, the "model" is represented as a series of JS API calls to WebNN API. Shipping a model means shipping that JS code.
I would like to propose that for the OS target APIs that need a graph (e.g. NNAPI) that we use an implicit graph with a JIT in Chrome instead of using an explicit graph API.
This is now the trend in machine learning systems (see: JAX) and will allow us to keep the API surface small, exposing only an operations API. This small API surface will allow each of the different clients of WebNN to be satisfied and keep the API eager (again, the current trend in ML).
@nsthorat Can you please clarify the difference between "graph with a JIT in Chrome" vs. "explicit graph API"?
One of the goals of the API is to leverage specialized ML hardware that is not directly available to the Javascript JIT. So, at the end of the day, the API will need to have some "explicit calls". No?
At a minimum, an operators API will need to have the concept of a typed buffer that is filled from Javascript, as well as the concept of an operator that inputs and outputs one or more buffers. Operators can be chained together by having the output buffer of one operator serve as the input buffer of another operator. With this approach, we can keep as much data on the GPU (or specialized ML hardware) as possible without incurring costly roundtrips to the CPU for decision making.
You are right, there is still an explicit API, but it is much more minimal than exposing a full graph API. Specifically it will allow you to call "jit" (obviously name tbd) around a function to convert it to a subgraph. The API surface here would be much smaller than a full graph API, like in TensorFlow < 2.0.
An explicit graph API would look like one of these (imagine your application calls an inference to an ML model multiple times):
TF Explicit Graph API
// Symbolic tensors to construct a graph
const xTensor = placeholder([2, 2]);
const yTensor = neg(square(xTensor));
while (true) {
// Execute the graph with concrete values.
const x = tensor([1, 2, 3, 4]);
const y = sess.run(y, {xTensors: x});
console.log(y); // -1, -4, -9, -16
}
In the JIT'd version, basically the user code would look eager, and this is what we do in TensorFlow.js (this is where all of the major machine learning in python libraries are moving):
// Here is the explicit call to "JIT", just like in JAX, where this function will get optimized and converted to a subgraph.
const f = jit(x => {
const y = neg(square(xTensor));
return y;
});
while (true) {
const x = tensor([1, 2, 3, 4]);
const y = f(x);
console.log(y); // -1, -4, -9, -16
}
The key piece here is that the API is still operation based and eager. Only when you are ready to optimize your subgraphs do you start calling off to "jit". This is also very similar to AutoGraph in TensorFlow where python functions get decorated and they are subsequently optimized.
All of these approaches each would allow keeping memory on the GPU or in the device, as you say. The tensor objects as part of this API would just be a handle to the underlying computation. When you ask for the underlying values on the CPU, then we synchronize the CPU / GPU (this is exactly what we do in TensorFlow.js -- more details in the TF.js paper in the sections about design and implementation).
Here is an example of a "JIT" style GPU compute API for WebGL that may serve as an example for how this kind of thing can work in a browser: http://gpu.rocks/
It has a "gpu.createKernel()" function that takes a JavaScript function and generates WebGL shaders out of it.
@nsthorat I am not sure about the claim that the "JIT" api is more minimal. There is still a need to standardize the set of ops (like "neg" and "square") that can be used inside the function that is to be "jitted". There is also a need to standardize/specify the set of language constructs that will be supported inside the function to be "jitted": e.g., will control-flow be permitted? To rephrase this, we need to standardize the set of ops (including control-flow ops, if desired) and their semantics, regardless of whether there is a jit or graph.
To clarify my previous comment: the JIT approach can make things easier for users, but it looks to me like a layer on top of a lower level layer/IR (operations and graph).
An eager API will be a smaller API surface, and this JIT proposal is a way to convert the eager code to a graph without adding a parallel set of APIs.
I'm going to add a couple examples that use control flow to show why the eager based approach is simpler and more flexible. Control flow ops (if, while, etc) are very common in RNNs / LSTMs.
// x is an instance of Tensor
const x = webnn.tensor([1, 2, 3]);
// jit() is optional here. It's simply a performance optimization. You could remove
// entirely the jit() function and this would all just work.
const y = jit(() =>
// mean is an instance of Tensor
const mean = x.mean();
// Control flow: we condition based on a value of mean being less than 3.
if (mean.data() < 3) {
// Tensors
return x.square().relu();
} else {
// Tensors
return x.sqrt().relu();
}
});
// prints the underlying values of y
console.log(y.data());
Datastructures needed:
Tensor
: holds concrete values, shape, rank, dtype
Methods:Tensor.data()
returns a CPU TypedArray or a native nested JavaScript arrayjit()
takes a function f(...xs: Tensor[]) => Tensor[]
and returns another decorated fgraph(...xs: Tensor[]) => Tensor[]
// xSymbolic is an instance of SymbolicTensor
const inputShape = [3];
const xSymbolic = webnn.inputNode(inputShape);
// mean is an instance of SymbolicTensor
const mean = x.mean();
// Control flow: we condition based on a value of mean being less than 3. We have to use
// the webnn.if statements because the values are not provided yet. ySymbolic is
// an instance of SymbolicTensor
const ySymbolic = webnn.if(
webnn.less(mean, 3),
// If less
x.square().relu(),
// If greater
x.sqrt().relu()
);
// Now actually run the graph with concrete values. x is an instanceof ConcreteTensor
const x = webnn.tensor([1, 2, 3]);
// Execute the graph, providing a concrete value for x. y is a ConcreteTensor
const y = webnn.run(ySymbolic, [[xSymbolic, x]]);
// prints the underlying values of y
console.log(y.data());
Datastructures needed:
ConcreteTensor
: holds concrete values, shape, rank, dtypeSymbolicTensor
: no concrete values, just shape, rank dtype. This is effectively a pointer to a node in the graph.
Methods:ConcreteTensor.data()
returns a CPU TypedArray or a native nested JavaScript arraywebbnn.inputNode()
: a function to define a graph input, returns a SymbolicTensor
SymbolicTensor
s. Arguably these should also take ConcreteTensors and return ConcreteTensors if we also want an eager operations API.nn.if()
, nn.while()
, etc.webbnn.run()
: A method that executes a Graph by feeding ConcreteTensors
in place of SymbolicTensors
Notice that in this code example, the webnn.if
statement is much more complex than in the first example. You have to think in graph-based conditionals using explicit APIs for control flow ops. This is exactly why TensorFlow, and the other major libraries, are moving away from Graph-based APIs, and moving towards eager only APIs. The model running as a Graph is going to be an internal implementation detail for performance.
I will add a couple other points that we've noticed in developing TensorFlow.js and in working with TensorFlow:
Since web standards take quite a while to get implemented, I feel it would be a shame to adopt the programming model that is being largely abandoned. It is true most of the complexity can be hidden by libraries, but control flow fundamentally cannot. We cannot in user-space convert native if
statements to the underlying webnn.if
statements -- this is something only the JavaScript interpreter could do.
@nsthorat wrote:
We cannot in user-space convert native
if
statements to the underlyingwebnn.if
statements -- this is something only the JavaScript interpreter could do.
One possible solution would be that jit()
in user-space parses the source of the js function and convert the if
statements to the underlying webnn.if
statements. It can apply for other control flow statements, like loops.
As @jdarpinian mentioned, gpu.js uses this solution:
It has a "gpu.createKernel()" function that takes a JavaScript function and generates WebGL shaders out of it.
There are some examples of gpu.js for control flow:
Let me split the various options that have been proposed/discussed as below:
(A) The operation API for executing single operations. This would serve a role similar to numpy in the case of python.
(B) The graph-builder and graph-executor API. The primary motivation for this over (A) is to enable optimizations that span multiple operations.
(C) The JIT api. Like (B), this tries to enable optimizations that span multiple operations while simplifying the developer’s job.
(D) The load-model API. The primary motivation here is to enable inference using a serialized representation of a model/graph. This has some advantages over (B) and (C) if the starting point is a pre-existing trained model created by other training frameworks.  Option C (jit) is an extension of option A. Option C essentially leaves the optimization job (across multiple ops) to the Javascript engine, with the “jit” construct serving only to give a hint to the optimizer where it should focus its optimization efforts. In an ideal world, we would not even need the "jit" construct (but I can see its value as a hint to the Javascript engine).
For a developer writing a model in Javascript, I agree that (C) is more convenient than (B). However, I just want to make sure that the debate between (B) and (C) doesn't rule out (D). My earlier impression was that we were debating (B) or (C) in addition to (D), not as a substitute for (D). But I just realized that others might think differently, and so what to bring up this point explicitly.
When I look at the goals listed at https://webmachinelearning.github.io/webnn , the vast majority of them (all of the application uses cases) would likely be realized by the application running a pre-built models (for face recognition, etc. etc.). And (D) seems to be the simplest solution that would address most of these needs. Â Regardless of whether we do (B) or (C) or both of them, they would need (A), that is: a standardization of the set of ops and their specification/semantics. So, that would be a logical starting point for all of (A), (B), and (C).
@gramalingam wrote:
(B) The graph-builder and graph-executor API. The primary motivation for this over (A) is to enable optimizations that span multiple operations.
I'd like to highlight the graph execution would be critical for good performance on some hardware accelerators.
For example, according to our experiment of int8 quantized MobileNet V2 on Pixel 3 smartphone, the graph execution time (~11ms) is about 130X faster than eager execution time (~1496ms).
For graph execution testing, our WebNN POC (proof-of-concept) builds and executes one NNAPI model/graph for all ops. Its performance is close to native (according to ai-benchmark, the native execution time of MobileNet V2 (test 1c.) on Pixel 3 is 11ms).
For eager execution testing, as NNAPI is graph based, our WebNN POC naively builds one model/graph for each op and executes these models/graphs one by one. Its performance is much lower than graph execution.
@nsthorat 's jit proposal sounds good for training scenarios but has drawbacks.
Multiple people in the group favor a graph or operators based API with the understanding that Javascript frameworks will take care of implementing the 'loadModel' API. This is crucial to the success of the API as the 'loadModel' use case is favored by the majority of partners we've spoken to.
With the 'jit' based approach, the theoretical 'loadModel' framework would need to parse the model's proprietary format, turn it into a series of strings concatenated together to form valid Javascript, evaluate the Javascript using Javascript engine, have the Javascript engine turn the model into a graph, compile the graph, and run the graph. That's alot of data transformations from one form to another. With a graph API, the Javascript framework can go straight from model parsing to in-memory graph, thus skipping Javascript parsing. As more web content moves to WASM over time, I expect people will want to keep things 'native' as much as possible for best performance.
Since the browser is better able to perform platform optimizations when given a graph instead of being given operators, I tend to think that (b) from @gramalingam 's list is where we should start: a graph builder and graph-executor API. A 'jit' can be added later if/when we tackle training scenarios.
I would strongly advise against targeting any serialization APIs, that world is constantly in flux. Best to leave that to application-level logic until the dust settles. This implies there will necessarily need to be an application-level translation layer between any model format and the browser's native APIs (thus making the internal details of the library unimportant).
The thing you get with a JIT-based approach is a smaller API surface. A graph is a great IR, but an unnatural way of programming. This is why most major libraries are moving away from this. If we decide to expose a graph-based API to the browser, you will have to support this forever (despite the fact that the trend is moving away from explicit graph-based APIs).
I think maybe we have to figure out what the intention of this proposal is. If you expect there to always be a library on top of browser-based APIs, then having a graph-based API is reasonable (because the application-logic will convert user-facing logic to the internal representation of a Graph). I would love for the long term goals of this project to be to allow direct usage of native calls in the browser. To me that completely rules out a graph-based API (again, graph-based IR makes sense).
Just want to point out that AutoGraph (a JIT in TensorFlow) is near-graph performance: Ref: Page 9 Table 2: https://arxiv.org/pdf/1810.08061.pdf
@nsthorat : I completely agree that programming by building a graph is unnatural. Programming using the host language's constructs will be simpler and more natural. However, I still have some questions about what you are proposing.
First, I would like to distinguish between the following two styles
(1) Vanilla operation API: This is an API for executing operations directly. Thus, executing "X = Vanilla.multiply(Y, Z)" requires Y and Z to represent concrete tensors (and nothing else).
(2) Eager operation API: This is an API that combines a direct execution of operations with a simultaneous construction of an underlying graph. Here, executing "X = Eager.multiply(Y,Z)" requires Y and Z to have associated concrete tensors values as well as a symbolic tensor, associated with a sub-graph that indicates how this value is computed. In short, this API is a composition that combines a vanilla operation API and a graph-builder API (performing both).
I am unclear which of the above two you are referring to in various places.
Just to complete the above picture, the graph-builder API would allow us to do "X = graph.multiply(Y,Z)" where Y and Z are symbolic tensors (within a graph), and X represents a new symbolic tensor created within the graph.
Now, as far as non-control-flow ops are considered, the Vanilla API, Eager API, as well as the Graph API will need to expose the same set of ops (like "Multiply") with the same signature (though they do different things). Do you agree that we need to define this set of operations (Multiply, Relu, Tanh, etc.) and their specification? Or, do you consider this as "being in flux" and out-of-scope for what we are doing here?
My primary concern with the JIT approach is the following: the JIT compiler can effectively build the graph and execute it if the developer writes code in a certain stylistic form. However, once the developer starts using the abstraction facilities available in Javascript, the JIT compiler will start running into undecidable static analysis problems in constructing the graph, and so it can only do a best-effort graph construction.
For example, if the user uses Javascript functions or objects with methods, what guarantees will the JIT compiler be able to provide in terms of how the functions/methods are handled during the graph construction? This could be particularly difficult since Javascript is not statically typed. So, we don’t even have a static type system to guide the JIT compiler.
The plus-side of the JIT approach is that it allows the users to use the Javascript syntax and features they are already familiar with. The down-side is that not all of the language features may be handled equally well by the JIT compiler. This, there is a potential lack of transparency, from a performance perspective, for the user. Should or will the users be concerned about this? I don't know. However, there is value in having the graph-builder API, which is less convenient from a syntax perspective, but gives the user the power to specify the graph that is constructed and passed on the execution backend.
I am not convinced that the "surface area" of the graph-builder API is much more than that of a "vanilla" API or an "eager" API. It is exactly equivalent to those APIs for the non-control-flow ops. The primary concept that needs to be pinned down for control-flow ops is the notion of a "function" or "closure" or "subgraph" that can be used to specify, for example, the then-branch or else-branch of an if-then-else. This is an important semantic concept. So, why try to avoid standardizing this concept?
I think it is possible to accommodate these diverse goals with a lower-level graph API and with the jit targetting this lower-level graph IR.
WebML CG Teleconference – 11 April 2019 resolutions:
RESOLVED: start "v0 spec" as a graph builder and graph-executor API and in "v1" explore direct usage of native calls in the browser as outlined in the JIT-based approach for improved web developer ergonomics
RESOLVED: Evolve WebNN API specification using https://‌github.com/‌intel/‌webml-polyfill/‌blob/‌master/‌docs/‌api.md as foundation specification
This issue has two resolutions - is there any need to keep it open?
This issue has some great discussion on eager execution that might inform future design considerations. I'd label this as enhancement.
Per https://www.w3.org/2024/09/23-webmachinelearning-minutes.html#t05 - consensus to close this issue, since we've decided to pursue a graph API.
Per resolution on the 14 Feb 2019 call, this issue is for discussing requirements for an API for executing operations.
IIRC @dsmilkov volunteered to take the first stab at this issue (thanks!). To frame the discussion, perhaps a good start is to evaluate the requirements through the lens of existing ML frameworks as API consumers. I believe also @huningxin's proof-of-concept might provide useful input.