Closed pyu10055 closed 4 years ago
- this is not very convenient since this is an extra step for the user and user might not know the shape of the output before hand.
Good catch! One possible solution is allowing user code to query the output shape, such as Android NNAPI @miaowang14.
2. The current execution is build on the compilation of the full graph, while the execution API does not prevent users to execution the sub-graph of the model,
I don't quite get this issue. Could you please elaborate the case that the API allows to compile the full graph while execute a sub-graph?
But the output shape may not always be statically known. It would be good if the API allows for the possibility of the output buffer being allocated during execution when its size is known. Can we treat the call to "setOutput" as optional (not required)?
But the output shape may not always be statically known. It would be good if the API allows for the possibility of the output buffer being allocated during execution when its size is known.
+1. The API may also need to allow specifying the input shape that is not supported currently.
Can we treat the call to "setOutput" as optional (not required)?
Then how to get the execution results? Could you please elaborate the idea?
The execution API essentialy needs to allow a way to specify the inputs, optionally specify output buffers, and get back the outputs. I think various possibilities exist for this. In the long run, it may actually be useful to support multiple signatures to support different ways of doing this (mostly as a syntactic convenience).
One option would be to have a return result (for example, a dictionary mapping output-name to its value).
One option would be to have a return result (for example, a dictionary mapping output-name to its value).
+1.
I'm also looking at a few related Operand
type issues. Maybe we can make one sweeping change.
Operand
type should be limited to tensor types i.e. only tensor-float32
type is allowed and not float32
. Non-tensor arrays should use sequences, lists, or TypedArray for clarity since a tensor can also be a GPU resource.Operand
should support sync and async download of tensor data to a TypedArray. This should address the first issue above.The execution API essentially needs to allow a way to specify the inputs, optionally specify output buffers, and get back the outputs.
+1.
According to current Execution API, the missing functionalities are specifying the input shape, get the output shape and data. So semantically, we may extend the setInput
method with an optional shape
parameter and add getOutputShape
and readOutput
methods, such as
interface Execution {
- void setInput(DOMString name, ArrayBufferView data);
+ void setInput(DOMString name, ArrayBufferView data, optional sequence<long> shape);
void setOutput(DOMString name, ArrayBufferView data);
Promise<void> startCompute();
+ sequence<long> getOutputShape(DOMString name);
+ Promise<void> readOutput(DOMString name, ArrayBufferView buffer);
};
With that, the sample code for dynamic input/output shape would be
// Create an Execution object for the compiled model.
const execution = compilation.createExecution();
const inputShape = [2, 2, 2, 2];
// Setup the input buffers with value 1.
const inputBuffer1 = new Float32Array(sizeOfShape(inputShape)).fill(1);
const inputBuffer2 = new Float32Array(sizeOfShape(inputShape)).fill(1);
// Specify the input data and its shape.
execution.setInput('input1', inputBuffer1, inputShape);
execution.setInput('input2', inputBuffer2, inputShape);
// Start the asynchronous computation.
await execution.startCompute();
// Get the output shape and allocate the buffer.
const outputShape = execution.getOutputShape('output');
const outputBuffer = new Float32Array(sizeOfShape(outputShape));
// Read the output values into the allocated buffer.
await execution.readOutput('output', outputBuffer);
// The computed result is now in outputBuffer.
console.log(outputBuffer);
One option would be to have a return result (for example, a dictionary mapping output-name to its value).
This could be a syntactic improvement. One difference is above proposal allows to get the value of a single output instead of returning them all.
I'm also looking at a few related
Operand
type issues.
@wchao1115 , I agree there are rooms to improve Operand
. However, Operand
is only used for graph building in current API. Do you propose to use it also for graph execution?
@huningxin I understand that Operand
as it is defined now acts more like a graph edge and not necessarily a data type like tensor. But in practice the two concepts are inseparable; it would be much simpler for a graph API to think of them as one. Both TensorFlow and PyTorch handle it this way.
Concretely, a few things stand out for me:
float32
data type as a separate notion from an operand of a tensor-float32
data type. An operand with float32
type should just mean a graph edge that would carry a tensor of type float32. Operand
type with more procedural data that doesn't really contribute to the identity of the operand type (as a graph edge). As previously pointed out, it also unnecessarily raises the requirement that all the operator API must support quantized types; a requirement that actually cannot be fulfilled in all cases.A leaner design in my mind would be one with an operand exposing a method for the app to optionally download its content to an ArrayBuffer only if they want to. This call would fail if it's made before the graph is executed since the output tensor won't be known then. With that API addition, it turns out that execution.setOutput
becomes entirely unnecessary since a model
would be previously created with the named dictionary of the output operands (see nn.createModel
) and that an ArrayBuffer of an output operand is created by the execution and not the construction of the graph.
2. Quantized tensor as a type should be avoided as it forces all the defined operators to support quantized type, which isn't realistic (#84). It was previously agreed that quantized types should be factored out (#44).
In #17 it was "RESOLVED: The specification will reference a subset of the ONNX operations". The ONNX operations include type constraints i.e. https://github.com/onnx/onnx/blob/master/docs/Operators.md#MatMulInteger . Can we just follow them?
So, requiring the app to always download the output tensor back to the CPU buffer view doesn't make sense, since it may want to pass the GPU resource along to other API that may accept it e.g. with ML models that emit output images, etc.
That's a good point. Besides passing the GPU-backed output tensor to other API, I think the Execution API should be able to accept the output tensor as input. With that, it would enable pipelining multiple graph executions without moving data cross devices.
A leaner design in my mind would be one with an operand exposing a method for the app to optionally download its content to an ArrayBuffer only if they want to. This call would fail if it's made before the graph is executed since the output tensor won't be known then.
My understanding is the input and output tensor of Execution API should be device dependent, say GPU resource. However, Operand
is device independent, for example, the nn.input
is just a symbolic one that never has values, and nn.constant
doesn't know which device to store its values until the compilation.
So I am thinking about adding a Tensor
interface that is backed by device memory. The Execution
would accept input Tensor
s and output Tensor
s. The Tensor
interface would have methods for async shape querying and data downloading. As Compilation
has target device, we may allow creating Tensor
from Compilation
. A code sketch would be
// Create an Execution object for the compiled model.
const execution = compilation.createExecution();
const inputShape = [2, 2, 2, 2];
// Setup the input buffers with value 1.
const inputBuffer1 = new Float32Array(sizeOfShape(inputShape)).fill(1);
const inputBuffer2 = new Float32Array(sizeOfShape(inputShape)).fill(1);
// Create the input and output tensors for compilation device.
const inputTensor1 = compilation.device.tensor(float32TensorType, inputBuffer1);
const inputTensor2 = compilation.device.tensor(float32TensorType, inputBuffer2);
const outputTensor = compilation.device.tensor();
// Set input and output tensors to an execution.
execution.setInput('input1', inputTensor1);
execution.setInput('input2', inputTensor2);
execution.setOutput('output', outputTensor);
// Start the computation, no async anymore, as tensor reading back would be async.
execution.startCompute(inputs, outputs);
// Query the output shape, await for shape inference done.
const outputShape = await outputTensor.getDimensions();
const outputBuffer = new Float32Array(sizeOfShape(outputShape));
// Read the data back, await for inference done.
await outputTensor.readInto(outputBuffer);
console.log(outputBuffer);
The execution.setOutput
may be still useful, as it would allow to reuse a Tensor
from one execution to another.
Hi @wchao1115 : one clarification about the attributes and operands: I think that the primary distinction between attributes and operands are that attributes represent values that are known statically, at the time the graph/model is built; on the other hand, operands represent values that typically will not be known statically (though, in some special cases they might be).
So, one key question we should address is how we want to handle scalar values that are known only at runtime (and not statically). Many frameworks do use tensors of rank 0 to represent such scalars. We could do the same. In this case, restricting operand types to be tensor-float332 and not float32 would work. So, this is doable, though it may incur a small performance penalty. But since these scalars are not going to be the performance bottleneck, it should be okay.
@gramalingam That's a good point. We haven't fully defined the notion of tensor scalar. From the graph execution standpoint, tensors are device-specific resources, either as inputs or outputs of the process.
@huningxin I think this is too complicated since the developers now need to deal with 3 different notions -- operand, tensor, and buffer. Besides, the tensors are now created by compilation
, which raises more questions than it answers.
Even though tensors are device-dependent, it doesn't mean that we have to surface it as a type to the users. I think the current notion of nn.input
is perfectly fine. If in the future we need an ability to construct an operand out of other device resources e.g. video frame, there is nothing stopping us from defining a nn.inputFromVideoFrame
method. We don't need to expose the notion of device-specific resource as a type to the users; we just need to give them a way to work with it. TF.js also does it this way. Their so-called Tensor
type in spite of its name is still a conceptual type and not a device-dependent resource type.
Same for the output operand, as long as there is a way to retrieve the computed content in it and put it in a CPU buffer or in a future resource type, there is no need to expose the notion of the device-specific resource type to the users.
Also, if you look at how nn.createModel
is defined, we already require through this API that the users specify the dictionary of all output operands, even before compilation. So all that we need is to expose a way to get the content out of it and we won't even need setOutput
anymore. We also don't need to pass the output operands to startCompute
, as it's already defined in the model.
Also, if you look at how
nn.createModel
is defined, we already require through this API that the users specify the dictionary of all output operands, even before compilation. So all that we need is to expose a way to get the content out of it and we won't even needsetOutput
anymore. We also don't need to pass the output operands tostartCompute
, as it's already defined in the model.
Users may compile the same Model
to multiple Compilation
s with different CompilationOptions
and target different underlying devices. In this case, there is only one output Operand
associated with the Model
. This Operand
might not be able to represent the different device resources. For example
const model = await nn.createModel([{name: 'output', operand: output}]);
// e.g. target iGPU
const compilation1 = await model.createCompilation({powerPreference: 'low-power'});
// e.g. target dGPU
const compilation2 = await model.createCompilation({powerPreference: 'high-performance'});
// Create Execution objects for the compiled models.
const execution1 = await compilation1.createExecution();
const execution2 = await compilation2.createExecution();
execution1.startCompute();
execution2.startCompute();
// Which device to read back, iGPU or dGPU?
await output.readInto(buffer);
Hi @huningxin : to read the output value into a buffer, we would need to invoke a "getValue(operand)" method on either execution or on a result returned by execution.startCompute(), rather than an operand.
@wchao1115 : I assume that the existing buffer abstractions suffice to represent memory on different devices? Assuming that, I think the primary distinction between a "Tensor" and a "buffer" would be that a "Tensor" attaches a "shape" to a buffer. I think that is a useful addition ... in most frameworks there is such a distinction. I assume the existing Buffer abstraction does not have a shape associated with it?
to read the output value into a buffer, we would need to invoke a "getValue(operand)" method on either execution
This might not allow passing the GPU-backed output to other API or a following execution without moving data between GPU and CPU.
or on a result returned by execution.startCompute(),
This would allow above use case. However, as execution is for one-shot inference, this might not allow reusing the device memory for multiple executions/inferences.
rather than an operand.
I agree not read the value from an operand.
I assume that the existing buffer abstractions suffice to represent memory on different devices?
If you mean ArrayBufferView
, I suppose it only represents CPU memory. WebGPU is defining GPUBuffer for GPU memory.
I assume the existing Buffer abstraction does not have a shape associated with it?
Neither ArrayBufferView nor GPUBuffer has a shape.
Can this work?
(I don't recall why we would need execution
in addition to compilation
; also simplify the method names for readability)
// model produces 2 outputs
const model = await nn.createModel([{name: 'output1', operand: output1}, {name: 'output2', operand: output2}]);
const compilation = await model.compile();
const buffers = await compilation.compute(['output1']);
// 'buffers' is sequence<ArrayBufferView>
// 'buffers[0]' holds a filled CPU buffer of just the content of output1
I don't recall why we would need
execution
in addition tocompilation
The previous discussion is at https://github.com/webmachinelearning/webnn/issues/39#issuecomment-589963045. Probably we need to revisit that.
also simplify the method names for readability
+1
Can this work?
Two opens:
sequence<ArrayBufferView>
would download the data to CPU memory. It might not able to allow the use case @wchao1115 mentioned in https://github.com/webmachinelearning/webnn/issues/87#issuecomment-688623748
it may want to pass the GPU resource along to other API that may accept it e.g. with ML models that emit output images, etc.
I prefer a single "compilation.compute(...)" method (as suggested by Chai).
Re. Ningxin's point 1: yes: we need to supply the input-values as parameters to the "compute" method.
Re. Ningxin's point 2: I think if we want to support this usecase, we have two options. (a) The caller specifies (using parameters to the "compute" method) whether they want the output copied to CPU or not. (b) We don't do any copying in "compute" and the caller copies the results to the CPU separately (if they want to).
I prefer option (b). So, basically, generalize Chai's suggestion to:
const outputs = await compilation.compute(input_values, desired_output_names);
where outputs is a sequence of output_type, and output_type is a union that includes tensors in GPU as well as CPU etc.
I think it is useful to have a type representing a tensor, regardless of where it is located.
The inputs are not set for a compute. Is it missed?
I omitted the nn.input
part in my sample code above for brevity. The input side is the same as before. If we want to accept an external resource for an input operand, we could add new methods e.g. nn.inputFromXXX
in the future.
we need to supply the input-values as parameters to the "compute" method.
Not sure why. I don't think we need to pass the input operands to compute
as it's already known and labeled in the graph.
it may want to pass the GPU resource along to other API that may accept it e.g. with ML models that emit output images, etc.
I looked at both canvas-2d and canvas-WebGL to understand how to move the output tensor data to either endpoint, and it looks like both paths do start with ArrayBufferView, which unfortunately would require a copy. But in most known scenarios the bottlenecks are on how to get the data in to the model. So requiring copy-out may not be too bad. Future resource sharing with WebGPU may be a possibility, but that would probably need to happen at the context level to ensure cross-adapter resource sharing. I think assuming selective CPU downloads on outputs is probably a safe bet.
Not sure why. I don't think we need to pass the input operands to compute as it's already known and labeled in the graph.
We know what the inputs (that is, the input-variables) of the graph are (when we build the graph), but not the values of these inputs (which are determined only when we call "compute" and will vary from one call of "compute" to another.
@gramalingam the input values are known, right? That's why it's the input.
(copied from the spec sample)
// Setup the input buffers with value 1.
const inputBuffer1 = new Float32Array(TENSOR_SIZE).fill(1);
const inputBuffer2 = new Float32Array(TENSOR_SIZE).fill(1);
// Associate the input buffers to model’s inputs.
execution.setInput('input1', inputBuffer1);
execution.setInput('input2', inputBuffer2);
Yes, we can specify the input values using execution.setInput(...)
. But if we want to eliminate execution
and directly call compute
from a compilation
, we need some mechanism to achieve the same goal as execution.setInput
.
Yes, we can specify the input values using
execution.setInput(...)
. But if we want to eliminateexecution
and directly callcompute
from acompilation
, we need some mechanism to achieve the same goal asexecution.setInput
.
That would be Compilation.setInput
. I just thought Compilation
and Execution
are redundant. It looks like we can make do with just 1 stateful object instead of 2.
interface Compilation {
void setInput(DOMString name, ArrayBufferView data);
Promise<sequence<ArrayBufferView>> compute(sequence<DOMString>);
};
I prefer a single "compilation.compute(...)" method (as suggested by Chai).
+1.
Re. Ningxin's point 1: yes: we need to supply the input-values as parameters to the "compute" method.
+1. This would make the Compilation
stateless.
(b) We don't do any copying in "compute" and the caller copies the results to the CPU separately (if they want to).
+1. This would enable passing the GPU resource along to other API even following compute
method.
I think it is useful to have a type representing a tensor, regardless of where it is located.
+1. I propose to add Tensor
interface and Compilation.compute
on tensors instead of ArrayBufferView.
There are some usages with code sketches (updated).
Create input and output tensors for a compilation:
// Create input tensors with full specified shape.
const tensor1 = compilation.tensor(inputShape1);
const tensor2 = compilation.tensor(inputShape2);
// Create output tensor without specifying shape.
const tensor3 = compilation.tensor();
Write data into the input tensors.
const inputBuffer1 = new Float32Array(sizeOfShape(inputShape)).fill(value);
tensor1.writeFrom(inputBuffer1);
// Similar code for tensor2.
Compute is not async, as reading back of tensor is async.
compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});
Async read back the output into a buffer
const outputBuffer = new Float32Array(sizeOfShape(outputShape));
await tensor3.readInto(outputBuffer);
If the output shape is dynamic, users could query the shape before allocating buffer.
// Might wait for shape inference done.
const outputShape = await outputTensor.getDimensions();
const outputBuffer = new Float32Array(sizeOfShape(outputShape));
A following compute could reuse the tensors without creating new ones.
// The second inference.
tensor1.writeFrom(inputBuffer1);
tensor2.writeFrom(inputBuffer2);
compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});
await tensor3.readInto(outputBuffer);
An output tensor could be passed to another compilation's (different graph) compute without downloading its data.
tensor1.writeFrom(inputBuffer1);
tensor2.writeFrom(inputBuffer2);
compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});
compilation2.compute({'input': tensor3}, {'output': tensor4});
// Only download the final result.
await tensor4.readInto(outputBuffer);
@huningxin It seems a key difference in this discussion so far is whether we should define a notion of Tensor in the API as a dedicated resource that can be implemented and passed in and out of the API, potentially enabling resource sharing with other API.
This idea is conceptually sound with slightly greater complexity, but my concern is on the implementation side, especially for the GPU implementation of the API. The first problem is that you actually don't want to map a Tensor to a GPU resource like a UAV buffer, one to one. In practice it's far more common to pool and reuse a single UAV buffer for multiple sets of tensor data with proper alignments. Frequent creation of GPU resources is prohibitively expensive.
Secondly, activities such as upload/download of tensor data from/to CPU buffers are commonly pipelined in the same command list along with shader dispatches that carry out the computation of graph operations, so that the entire command list can be pipelined and executed in one go. It would be highly inefficient to break up the command list just to upload the tensor data for instance.
And lastly, there are some type of tensor data e.g. constant weight data that sometimes requires special treatment depending on the underlying hardware; a process that involves an additional copy pass before issuing shader dispatches. It's hard to properly locate the timing of that copy pass when the tensor data is associated with the Tensor API objects that can operate independently from the rest of the graph execution process.
Based on this set of difficulties, I'm still advocating against defining a Tensor type at the API level and instead continuing with the current approach of explicit upload/download methods off the Compilation
interface.
@wchao1115 , thanks for sharing your thoughts on GPU implementation that is extremely helpful.
The first problem is that you actually don't want to map a Tensor to a GPU resource like a UAV buffer, one to one.
+1. I think we might not need to maintain the one to one mapping between a Tensor and a GPU resource. A tensor could just "record" the logical resource usage, such as input or output of a compute. The real resource allocation is up to the implementation.
Secondly, activities such as upload/download of tensor data from/to CPU buffers are commonly pipelined in the same command list along with shader dispatches that carry out the computation of graph operations
That's a good point.
To test my understanding, I would slightly change the sketch code of the inference loop as:
tensor1.writeFrom(inputBuffer1);
tensor2.writeFrom(inputBuffer2);
compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});
await tensor3.readInto(outputBuffer);
When running above code, a GPU implementation might be able to delay the execution until the last step (await tensor3.readInto(outputBuffer)
). When user wants to read back the data, the GPU implementation could pipeline the CPU memory upload for tensor1
and tensor2
, shader dispatches of compiled operations of compilation
and CPU memory download from tensor3
into one command list, and execute it in one go .
And lastly, there are some type of tensor data e.g. constant weight data that sometimes requires special treatment depending on the underlying hardware;
The constant data would not be represented by Tensor. We use nn.constant
to associate the constant data to a model. For GPU implementation, the special treatment including an additional copy pass could be handled in model.compile
.
Does this work?
I think we might not need to maintain the one to one mapping between a Tensor and a GPU resource. A tensor could just "record" the logical resource usage, such as input or output of a compute.
In practice the backing implementation of the API Tensor type would most likely be just a sort of handle with a refcount pointer to the context, which complicates the implementation since there will need to be additional bookkeeping going internally. It'll also tax resource management and garbage collection with small handles moving around. The API object gives a false impression that it can do something useful.
When running above code, a GPU implementation might be able to delay the execution until the last step (await tensor3.readInto(outputBuffer)).
That is very unintuitive and misleading to developers. It is very natural to expect that operations like shader dispatches happen at compute
. The download copy to a CPU buffer should be entirely optional. I'm hesitated for an API that doesn't lend itself well to intuition. It is very important to convey the intent of the API so people knows what to expect.
The constant data would not be represented by Tensor. We use nn.constant to associate the constant data to a model.
Understood. I'm not suggesting that we change the notion of nn.constant
. I'm pointing out that during graph execution the implementation will need to tag the tensors backing constant operands for that extra initialization stage as many GPU drivers pre-process weight data ahead of execution. By splitting up a notion of API Tensor, it's one additional thing to keep track on during graph execution.
I'm pointing out that during graph execution the implementation will need to tag the tensors backing constant operands for that extra initialization stage as many GPU drivers pre-process weight data ahead of execution. By splitting up a notion of API Tensor, it's one additional thing to keep track on during graph execution.
This "pre-process weight data" is called packing #86 and also happens with CPU drivers. To generalize slightly, the tensor need not be a constant but simply reused enough to justify packing it. For example, a transformer encoder's outputs will be consulted repeatedly by a transformer decoder.
You appear to be assuming that graph compilation is the right place to pack parameter matrices. Is your graph compilation general enough to support graph structures that differ for every input like syntactic parsing and the existence of the dynet toolkit? https://github.com/clab/dynet/ If not, then then graph compilation is not the place to stuff packing.
You appear to be assuming that graph compilation is the right place to pack parameter matrices. Is your graph compilation general enough to support graph structures that differ for every input like syntactic parsing and the existence of the dynet toolkit?
@kpu WebNN only defines a contract between web applications and the browsers. It doesn't prescribe how a graph compilation step is to be implemented. That will be up to the implementation of this contract in the browser. What I'm saying is:
Packed weight formats vary depending on the underlying hardware and because of that it may not be suitable to surface the in a web API contract. In the case of the GPU, different drivers, sometimes even from the same IHV vendor, use different pre-processed formats as it also depends on the hardware generation, the weight pre-processing step is carried out between the OS and the GPU drivers today.
The graph compilation step as defined in the contract as model.compile
is a suitable timing for the browser implementation to mediate weight preprocessing with the OS.
1. Packed weight formats vary depending on the underlying hardware and because of that it may not be suitable to surface the in a web API contract. In the case of the GPU, different drivers, sometimes even from the same IHV vendor, use different pre-processed formats as it also depends on the hardware generation, the weight pre-processing step is carried out between the OS and the GPU drivers today.
I am not suggesting the pre-processed format be exposed to the user. I am suggesting the user hint that it should be done and the format be opaque to the user.
2. The graph compilation step as defined in the contract as `model.compile` is a suitable timing for the browser implementation to mediate weight preprocessing with the OS.
There are models that create a new, different, graph for every input. The most basic example of this is that sentences have different lengths. A more extreme example is running a syntactic parser then using the parse as a graph structure. So does graph compilation support these cases in one compilation? If not, then graph compilation is the wrong place because weight packing should happen once per weight tensor, not once per graph per weight tensor.
That's a good point. There are models whose behavior depending on input shapes which may not be known until execution time e.g. free dimensions, ROI, variable length RNN, etc. For such models, not every part of the graph can be finalized at the compile step until after all the inputs are bound to it at execution time. But even so, the compile step is still per-model and not per-graph; constant inputs are fixed-shape and therefore can still be pre-processed at compile time.
@wchao1115
That would be
Compilation.setInput
. I just thoughtCompilation
andExecution
are redundant. It looks like we can make do with just 1 stateful object instead of 2.
I understand there is only 1 stateful object (Execution
) in today's spec. The Compilation
has not been stateful since we landed https://github.com/webmachinelearning/webnn/pull/46.
interface Compilation { void setInput(DOMString name, ArrayBufferView data); Promise<sequence<ArrayBufferView>> compute(sequence<DOMString>); };
This Compilation
proposal would work if we eliminate Execution
. However it still needs to add the support of input/output with dynamic shape that is requested by @pyu10055 .
If we keep Execution
, my proposal of dynamic shape support in Execution
was raised in https://github.com/webmachinelearning/webnn/issues/87#issuecomment-688612462. I am pasting it here for the convenience.
interface Execution {
- void setInput(DOMString name, ArrayBufferView data);
+ void setInput(DOMString name, ArrayBufferView data, optional sequence<long> shape);
void setOutput(DOMString name, ArrayBufferView buffer);
Promise<void> startCompute();
+ sequence<long> getOutputShape(DOMString name);
+ Promise<void> readOutput(DOMString name, ArrayBufferView buffer);
};
Another difference is either returning a sequence of ArrayBufferView
or writing into the pre-allocated ArrayBufferView
. The approach of writing into the pre-allocated buffer would avoid creating new ArrayBufferViews in each compute. And when passing the output to WebAssembly code, e.g. for post-processing or custom operators, this approach would also avoid JS to WebAssembly memory copy, as discussed in https://github.com/w3c/machine-learning-workshop/issues/93 and https://github.com/WebAssembly/design/issues/1162.
That's a fair point @huningxin. It sounds like we agree that there needs to be a way to handle dynamic input/output shape whether we want to collapse Compilation
and Execution
into one.
A note on perf though, dynamic input shape is bad for perf since the model defers the input shape until after graph compilation. This makes it hard for the platform to appropriately pipeline the graph at compile time. So, yes there are models that do that today, but it's not good from the performance point of view.
Chai referred to "resource management and garbage collection". I think this is an important point worth fleshing out, regardless of which interfaces we adopt as discussed above.
Specifically, how does memory allocation and deallocation happen and to what extent does the web application have control and responsibilities over this? Is there an accepted standard for this in other existing browser IDL APIs that will suffice and work for us? (That is, since garbage-collection may not be good enough for the underlying buffers.)
A note on perf though, dynamic input shape is bad for perf since the model defers the input shape until after graph compilation. This makes it hard for the platform to appropriately pipeline the graph at compile time. So, yes there are models that do that today, but it's not good from the performance point of view.
Dynamic input shape is not, in it of itself, bad for perf. Compilers that run once are bad for perf of models that exist in the wild. A (theoretical) good compiler runs ahead of time but also memoizes partial compilation so that just-in-time changes can be made in response to dynamic shape. I would suggest Compilation
and Execution
be collapsed because it will make abstracting and blurring ahead-of-time and just-in-time approaches easier, instead of enforcing an ahead-of-time compilation as part of the API.
@kpu JIT is always more flexible as long as its overhead is properly managed. However, there are also important use cases where the apps want to pre-compile and reuse the compiled models over and over again in high-bandwidth call paths e.g. real-time segmentation, super-resolution, etc.; dynamism in the models e.g. free input dimensions, variable sequences, etc. is troubling for that type of scenario. I think we should design the API so that deferred compilation is possible and pre-compiled is efficient.
Execution implies compilation right? What is the performance benefit of an explicit compilation step vs an implicit cached one?
Compilation takes time. An explicit compile step gives apps more control over when to spend that time, which is quite important for end-to-end perf. Given a model that lends itself well to static compilation (i.e. ahead-of-time), it's much more efficient to have the model compiled ahead of execution such as during app's initialization, page load, or model's construction time, while execution may actually happen a lot more frequently afterward with the pre-compiled models e.g. every visual updates.
JIT also tends to have management overhead cost from figuring out when and how to evict the cache and re-JIT things down, with the worst case being that you always have to JIT the model before execution, which could be really inefficient.
It does, however, simplify the API and make it easier to use. You could eliminate the compile step altogether. But it comes at a cost of having less control. In the context of the GPU, it's almost always preferable to pre-compile because dynamic processes like fusions, layout assignments, and weight packing, could break GPU pipelining. It's still doable but with more caveats and care to prevent bubbles.
So, I think we still need to keep the compile step. We should be able to fold Compilation
and Execution
into one though.
So, I think we still need to keep the compile step.
+1
We should be able to fold
Compilation
andExecution
into one though.
+1
Additionally, we may consider to remove the Model
interface as the graph info already exists. With that, the graph compilation code would be simplified as:
const a = nn.input('a', descriptorA);
const b = nn.constant(descriptorB, bufferB);
const c = nn.matmul(a, b);
-const model = await nn.createModel([{name: 'c', operand: c}]);
-const compilation = await model.createCompilation({powerPreference: 'low-power'});
+const compilation = await nn.compile({'c': c}, {powerPreference: 'low-power'});
We may also consider to make the folded Compilation
interface stateless by supplying input buffers to compute
method.
With that, for pre-allocated output buffers, the graph execution code would be simplified as:
-const execution = await compilation.createExecution();
-execution.setInput('a', bufferA);
-execution.setOutput('c', bufferC);
-await execution.startCompute();
+await compilation.compute({'a': bufferA}, {'c': bufferC});
The above code should be able to cover the existing use cases of today's spec. The following samples are for new requirements: 1) no pre-allocated output buffers; 2) dynamic shaped input/output.
If the pre-allocated output buffers are not supplied, compute
would return the output buffers with shape info. For example:
let results = await compilation.compute({'a': bufferA});
console.log(results.c.shape);
console.log(results.c.buffer);
For dynamic shaped inputs, compute
would allow to specify the input shape. For example:
let results = await compilation.compute({'a': {shape: shapeA, buffer: bufferA}});
console.log(results.c.shape);
console.log(results.c.buffer);
Because the pre-allocated buffer would avoid extra memory copy to WebAssembly memory, it would be useful to support dynamic output shape for this usage as well. One option is compute
still returns the inferred output shape when pre-allocated buffer provided.
let results = await compilation.compute({'a': {shape: shapeA, buffer: bufferA}}, {'c': bufferC});
console.log(results.c.shape);
If the size of pre-allocated buffer is not big enough, compute
would reject the promise with an error. The error may contain the inferred output shape that helps users to re-allocate buffer.
try {
results = await compilation.compute({'a': {shape: shapeA, buffer: bufferA}}, {'c': bufferC});
} catch (err) {
bufferC = new Float32Array(sizeOfShape(err.c.shape));
results = await compilation.compute({'a': {shape: shapeA, buffer: bufferA}}, {'c': bufferC});
}
Just to make sure I understand. You're talking about something like this?
dictionary Input {
required DOMString name;
required ArrayBufferView buffer;
sequence<long> dimensions; // optional
};
dictionary Output {
required DOMString name;
ArrayBufferView buffer; // optional
sequence<long> dimensions; // optional
};
interface Compilation {
Promise<sequence<Output>> compute(sequence<Input> inputs, optional sequence<Output> outputs);
};
outputs
is not specified, return all model outputs with buffers and shapesThat could work.
@wchao1115
You're talking about something like this?
Semantic wise, yes. There are some syntax tweaks. I propose to use record<K, V> that allows the literal syntax for mapping name to input/output.
dictionary Input {
required ArrayBufferView buffer;
sequence<long> dimensions; // optional
};
dictionary Output {
ArrayBufferView buffer; // optional
sequence<long> dimensions; // optional
};
typedef record<DOMString, Input> Inputs;
typedef record<DOMString, Output> Outputs;
interface Compilation {
Promise<Outputs> compute(Inputs inputs, optional Outputs outputs);
};
With that, user could pass inputs like following code:
let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}});
and can access output by named property:
console.log(results.c.dimensions);
console.log(results.c.buffer);
WDYT?
@huningxin @wchao1115 Are we combining the compilation and execution into a single method compilation.compute? For users to supply the Input/Output parameters, do they need to know the shape prior calling the compute method?
@huningxin @wchao1115 Slowly reading through the thread, I think the proposal solves my concerns with the original API.
One extra question is the concept of compilation to the end user, it looks like the compilation become implicit, since it is combined with execution. It is not clear to me why we need to introduce this concept. Especial for end users do not care about controlling the compilation, it will much clear to have an execute method directly on the Model interface.
model.execution(Inputs inputs, optional Outputs outputs);
@pyu10055
Are we combining the compilation and execution into a single method compilation.compute?
No. They are separated.
The compilation would be done by nn.compile
as:
const a = nn.input('a', descriptorA);
const b = nn.constant(descriptorB, bufferB);
const c = nn.matmul(a, b);
const compilation = await nn.compile({'c': c}, {powerPreference: 'low-power'});
The execution is done by compliation.compute
as:
let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}});
console.log(results.c.dimensions);
console.log(results.c.buffer);
For users to supply the Input/Output parameters, do they need to know the shape prior calling the compute method?
It would not require users to know the output shape prior calling the compute method. According to compute's signature, Promise<Outputs> compute(Inputs inputs, optional Outputs outputs)
, when the outputs
parameter is not provided, the compute method would return the output buffer and shape as above sample code shows.
For static input shape, for example
const a = nn.input({type: 'tensor-float32', dimensions: [2, 2]});
The users can just provide the buffer when calling the compute method, such as
let results = await compilation.compute({'a': {buffer: bufferA}});
For dynamic input shape, for example
const a = nn.input({type: 'tensor-float32', dimensions: [-1, 2]});
The users need to provide both buffer and shape, such as:
let results = await compilation.compute({'a': {buffer: bufferA, dimensions: [2, 2]}});
@huningxin Actually this has gone back a little bit, take following graph as example:
const a = nn.input('a', descriptorA);
const b = nn.constant(descriptorB, bufferB);
const c = nn.matmul(a, b);
const d = nn.matmul(a, c);
const compilation = await nn.compile({'d': d}, {powerPreference: 'low-power'});
The compile is done for the full graph, now I want to execute the subgraph which has c as the output, can the previous compilation used for this new execution:
let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}}, {'c': {...}));
console.log(results.c.dimensions);
console.log(results.c.buffer);
Or the user need to create a new compilation?
const compilation2 = await nn.compile({'c': c}, {powerPreference: 'low-power'});
@pyu10055
Or the user need to create a new compilation?
Yes. The users could create a new compilation for that subgraph.
I am thinking about an alternative way. That allows to specify multiple outputs when compile, such as:
const compilation = await nn.compile({'c': c, 'd': d}, {powerPreference: 'low-power'});
The users may just compute either c or d.
let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}}, {'c': {}));
console.log(results.c.dimensions);
console.log(results.c.buffer);
results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}}, {'d': {}));
console.log(results.d.dimensions);
console.log(results.d.buffer);
Would this work?
@huningxin This would work, but from usability point of view, why users need to care about the compilation step? Can the compilation be an internal implementation detail that is hidden from the end users? The reason is that I don't see the benefit of calling compile + compute API, comparing to a single API:
model.execution(Inputs inputs, optional Outputs outputs, {powerPreference: 'low-power'});
@pyu10055
why users need to care about the compilation step?
I think the major reason is compilation takes time and the users may want to handle that. @wchao1115 shared more insights at https://github.com/webmachinelearning/webnn/issues/87#issuecomment-693699105
One example is, with webnn-polyfill, the compilation time and first inference time of LetNet example are:
compilation elapsed time: 582.70 ms
execution elapsed time: 27.80 ms
As we discussed, as there is no dedicated compilation step of tf.js, the webnn-polyfill emulates the compilation by letting tf.js infer once. The first inference would compile the WebGL kernels, so it takes much longer time comparing to following inferences. I guess the compilation in native API would be similar.
This would work, but from usability point of view, why users need to care about the compilation step?
It comes down to the choice between ease-of-use vs. control. A separate compile step allows more control for the users of the API to decide when would be an ideal time to spend on model compilation. For the scenarios described earlier in the thread, it is important to keep it.
@huningxin I'm not familiar with record<K,V>
. Is it supported in the w3c webidl spec? Can you point me to the standard definition? I'm not sure what the c
in this expression results.c.dimensions
stands for.
There are couple questions on the existing execution API:
The current model execution API requires users to provide output buffers before execution, this is not very convenient since this is an extra step for the user and user might not know the shape of the output before hand. Also, for many model this output shape is based on the input shape, it is an extra burden for users to find that out.
The current execution is build on the compilation of the full graph, while the execution API does not prevent users to execution the sub-graph of the model, it is not clear why the pre-compilation is needed and should it be internal of the execution, so it can take care of sub-graph execution.