Model Execution API - Githubissues

pyu10055 commented 4 years ago

There are couple questions on the existing execution API:

The current model execution API requires users to provide output buffers before execution, this is not very convenient since this is an extra step for the user and user might not know the shape of the output before hand. Also, for many model this output shape is based on the input shape, it is an extra burden for users to find that out.
The current execution is build on the compilation of the full graph, while the execution API does not prevent users to execution the sub-graph of the model, it is not clear why the pre-compilation is needed and should it be internal of the execution, so it can take care of sub-graph execution.

huningxin commented 4 years ago

this is not very convenient since this is an extra step for the user and user might not know the shape of the output before hand.

Good catch! One possible solution is allowing user code to query the output shape, such as Android NNAPI @miaowang14.

2. The current execution is build on the compilation of the full graph, while the execution API does not prevent users to execution the sub-graph of the model,

I don't quite get this issue. Could you please elaborate the case that the API allows to compile the full graph while execute a sub-graph?

gramalingam commented 4 years ago

But the output shape may not always be statically known. It would be good if the API allows for the possibility of the output buffer being allocated during execution when its size is known. Can we treat the call to "setOutput" as optional (not required)?

huningxin commented 4 years ago

But the output shape may not always be statically known. It would be good if the API allows for the possibility of the output buffer being allocated during execution when its size is known.

+1. The API may also need to allow specifying the input shape that is not supported currently.

Can we treat the call to "setOutput" as optional (not required)?

Then how to get the execution results? Could you please elaborate the idea?

gramalingam commented 4 years ago

The execution API essentialy needs to allow a way to specify the inputs, optionally specify output buffers, and get back the outputs. I think various possibilities exist for this. In the long run, it may actually be useful to support multiple signatures to support different ways of doing this (mostly as a syntactic convenience).

One option would be to have a return result (for example, a dictionary mapping output-name to its value).

wchao1115 commented 4 years ago

One option would be to have a return result (for example, a dictionary mapping output-name to its value).

+1.

I'm also looking at a few related Operand type issues. Maybe we can make one sweeping change.

The Operand type should be limited to tensor types i.e. only tensor-float32 type is allowed and not float32. Non-tensor arrays should use sequences, lists, or TypedArray for clarity since a tensor can also be a GPU resource.
Quantized tensor as a type should be avoided as it forces all the defined operators to support quantized type, which isn't realistic (#84). It was previously agreed that quantized types should be factored out (#44).
Operand should support sync and async download of tensor data to a TypedArray. This should address the first issue above.

huningxin commented 4 years ago

The execution API essentially needs to allow a way to specify the inputs, optionally specify output buffers, and get back the outputs.

+1.

According to current Execution API, the missing functionalities are specifying the input shape, get the output shape and data. So semantically, we may extend the setInput method with an optional shape parameter and add getOutputShape and readOutput methods, such as

 interface Execution {
-  void setInput(DOMString name, ArrayBufferView data);
+  void setInput(DOMString name, ArrayBufferView data, optional sequence<long> shape);
   void setOutput(DOMString name, ArrayBufferView data);
   Promise<void> startCompute();
+  sequence<long> getOutputShape(DOMString name);
+  Promise<void> readOutput(DOMString name, ArrayBufferView buffer);
 };

With that, the sample code for dynamic input/output shape would be

// Create an Execution object for the compiled model.
const execution = compilation.createExecution();

const inputShape = [2, 2, 2, 2];
// Setup the input buffers with value 1.
const inputBuffer1 = new Float32Array(sizeOfShape(inputShape)).fill(1);
const inputBuffer2 = new Float32Array(sizeOfShape(inputShape)).fill(1);

// Specify the input data and its shape.
execution.setInput('input1', inputBuffer1, inputShape);
execution.setInput('input2', inputBuffer2, inputShape);

// Start the asynchronous computation.
await execution.startCompute();

// Get the output shape and allocate the buffer.
const outputShape = execution.getOutputShape('output');
const outputBuffer = new Float32Array(sizeOfShape(outputShape));

// Read the output values into the allocated buffer.
await execution.readOutput('output', outputBuffer);

// The computed result is now in outputBuffer.
console.log(outputBuffer);

One option would be to have a return result (for example, a dictionary mapping output-name to its value).

This could be a syntactic improvement. One difference is above proposal allows to get the value of a single output instead of returning them all.

huningxin commented 4 years ago

I'm also looking at a few related Operand type issues.

@wchao1115 , I agree there are rooms to improve Operand. However, Operand is only used for graph building in current API. Do you propose to use it also for graph execution?

wchao1115 commented 4 years ago

@huningxin I understand that Operand as it is defined now acts more like a graph edge and not necessarily a data type like tensor. But in practice the two concepts are inseparable; it would be much simpler for a graph API to think of them as one. Both TensorFlow and PyTorch handle it this way.

Concretely, a few things stand out for me:

The operator API already accepts both attribute and tensor arguments, with the former specified as native data types e.g. float, bool, etc. Therefore there is no point at also defining an operand of a float32 data type as a separate notion from an operand of a tensor-float32 data type. An operand with float32 type should just mean a graph edge that would carry a tensor of type float32.
Although requiring a CPU buffer view to construct an input operand may make sense, an output operand could be backed by a GPU tensor that is constructed on the fly by the graph execution process. So, requiring the app to always download the output tensor back to the CPU buffer view doesn't make sense, since it may want to pass the GPU resource along to other API that may accept it e.g. with ML models that emit output images, etc. A graph execution may also produce multiple outputs, and the app may choose to ignore some of them.
Quantized tensor further clouds the Operand type with more procedural data that doesn't really contribute to the identity of the operand type (as a graph edge). As previously pointed out, it also unnecessarily raises the requirement that all the operator API must support quantized types; a requirement that actually cannot be fulfilled in all cases.

A leaner design in my mind would be one with an operand exposing a method for the app to optionally download its content to an ArrayBuffer only if they want to. This call would fail if it's made before the graph is executed since the output tensor won't be known then. With that API addition, it turns out that execution.setOutput becomes entirely unnecessary since a model would be previously created with the named dictionary of the output operands (see nn.createModel) and that an ArrayBuffer of an output operand is created by the execution and not the construction of the graph.

kpu commented 4 years ago

2. Quantized tensor as a type should be avoided as it forces all the defined operators to support quantized type, which isn't realistic (#84). It was previously agreed that quantized types should be factored out (#44).

In #17 it was "RESOLVED: The specification will reference a subset of the ONNX operations". The ONNX operations include type constraints i.e. https://github.com/onnx/onnx/blob/master/docs/Operators.md#MatMulInteger . Can we just follow them?

huningxin commented 4 years ago

So, requiring the app to always download the output tensor back to the CPU buffer view doesn't make sense, since it may want to pass the GPU resource along to other API that may accept it e.g. with ML models that emit output images, etc.

That's a good point. Besides passing the GPU-backed output tensor to other API, I think the Execution API should be able to accept the output tensor as input. With that, it would enable pipelining multiple graph executions without moving data cross devices.

A leaner design in my mind would be one with an operand exposing a method for the app to optionally download its content to an ArrayBuffer only if they want to. This call would fail if it's made before the graph is executed since the output tensor won't be known then.

My understanding is the input and output tensor of Execution API should be device dependent, say GPU resource. However, Operand is device independent, for example, the nn.input is just a symbolic one that never has values, and nn.constant doesn't know which device to store its values until the compilation.

So I am thinking about adding a Tensor interface that is backed by device memory. The Execution would accept input Tensors and output Tensors. The Tensor interface would have methods for async shape querying and data downloading. As Compilation has target device, we may allow creating Tensor from Compilation. A code sketch would be

// Create an Execution object for the compiled model.
const execution = compilation.createExecution();

const inputShape = [2, 2, 2, 2];
// Setup the input buffers with value 1.
const inputBuffer1 = new Float32Array(sizeOfShape(inputShape)).fill(1);
const inputBuffer2 = new Float32Array(sizeOfShape(inputShape)).fill(1);

// Create the input and output tensors for compilation device.
const inputTensor1 = compilation.device.tensor(float32TensorType, inputBuffer1);
const inputTensor2 = compilation.device.tensor(float32TensorType, inputBuffer2);
const outputTensor = compilation.device.tensor();

// Set input and output tensors to an execution.
execution.setInput('input1', inputTensor1);
execution.setInput('input2', inputTensor2);
execution.setOutput('output', outputTensor);

// Start the computation, no async anymore, as tensor reading back would be async.
execution.startCompute(inputs, outputs);

// Query the output shape, await for shape inference done.
const outputShape = await outputTensor.getDimensions();
const outputBuffer = new Float32Array(sizeOfShape(outputShape));

// Read the data back, await for inference done.
await outputTensor.readInto(outputBuffer);
console.log(outputBuffer);

The execution.setOutput may be still useful, as it would allow to reuse a Tensor from one execution to another.

gramalingam commented 4 years ago

Hi @wchao1115 : one clarification about the attributes and operands: I think that the primary distinction between attributes and operands are that attributes represent values that are known statically, at the time the graph/model is built; on the other hand, operands represent values that typically will not be known statically (though, in some special cases they might be).

So, one key question we should address is how we want to handle scalar values that are known only at runtime (and not statically). Many frameworks do use tensors of rank 0 to represent such scalars. We could do the same. In this case, restricting operand types to be tensor-float332 and not float32 would work. So, this is doable, though it may incur a small performance penalty. But since these scalars are not going to be the performance bottleneck, it should be okay.

wchao1115 commented 4 years ago

@gramalingam That's a good point. We haven't fully defined the notion of tensor scalar. From the graph execution standpoint, tensors are device-specific resources, either as inputs or outputs of the process.

wchao1115 commented 4 years ago

@huningxin I think this is too complicated since the developers now need to deal with 3 different notions -- operand, tensor, and buffer. Besides, the tensors are now created by compilation, which raises more questions than it answers.

Even though tensors are device-dependent, it doesn't mean that we have to surface it as a type to the users. I think the current notion of nn.input is perfectly fine. If in the future we need an ability to construct an operand out of other device resources e.g. video frame, there is nothing stopping us from defining a nn.inputFromVideoFrame method. We don't need to expose the notion of device-specific resource as a type to the users; we just need to give them a way to work with it. TF.js also does it this way. Their so-called Tensor type in spite of its name is still a conceptual type and not a device-dependent resource type.

Same for the output operand, as long as there is a way to retrieve the computed content in it and put it in a CPU buffer or in a future resource type, there is no need to expose the notion of the device-specific resource type to the users.

Also, if you look at how nn.createModel is defined, we already require through this API that the users specify the dictionary of all output operands, even before compilation. So all that we need is to expose a way to get the content out of it and we won't even need setOutput anymore. We also don't need to pass the output operands to startCompute, as it's already defined in the model.

huningxin commented 4 years ago

Also, if you look at how nn.createModel is defined, we already require through this API that the users specify the dictionary of all output operands, even before compilation. So all that we need is to expose a way to get the content out of it and we won't even need setOutput anymore. We also don't need to pass the output operands to startCompute, as it's already defined in the model.

Users may compile the same Model to multiple Compilations with different CompilationOptions and target different underlying devices. In this case, there is only one output Operand associated with the Model. This Operand might not be able to represent the different device resources. For example

const model = await nn.createModel([{name: 'output', operand: output}]);

// e.g. target iGPU
const compilation1 = await model.createCompilation({powerPreference: 'low-power'});
// e.g. target dGPU
const compilation2 = await model.createCompilation({powerPreference: 'high-performance'});

// Create Execution objects for the compiled models.
const execution1 = await compilation1.createExecution();
const execution2 = await compilation2.createExecution();

execution1.startCompute();
execution2.startCompute();

// Which device to read back, iGPU or dGPU?
await output.readInto(buffer);

gramalingam commented 4 years ago

Hi @huningxin : to read the output value into a buffer, we would need to invoke a "getValue(operand)" method on either execution or on a result returned by execution.startCompute(), rather than an operand.

@wchao1115 : I assume that the existing buffer abstractions suffice to represent memory on different devices? Assuming that, I think the primary distinction between a "Tensor" and a "buffer" would be that a "Tensor" attaches a "shape" to a buffer. I think that is a useful addition ... in most frameworks there is such a distinction. I assume the existing Buffer abstraction does not have a shape associated with it?

huningxin commented 4 years ago

to read the output value into a buffer, we would need to invoke a "getValue(operand)" method on either execution

This might not allow passing the GPU-backed output to other API or a following execution without moving data between GPU and CPU.

or on a result returned by execution.startCompute(),

This would allow above use case. However, as execution is for one-shot inference, this might not allow reusing the device memory for multiple executions/inferences.

rather than an operand.

I agree not read the value from an operand.

I assume that the existing buffer abstractions suffice to represent memory on different devices?

If you mean ArrayBufferView, I suppose it only represents CPU memory. WebGPU is defining GPUBuffer for GPU memory.

I assume the existing Buffer abstraction does not have a shape associated with it?

Neither ArrayBufferView nor GPUBuffer has a shape.

wchao1115 commented 4 years ago

Can this work? (I don't recall why we would need execution in addition to compilation; also simplify the method names for readability)


// model produces 2 outputs
const model = await nn.createModel([{name: 'output1', operand: output1}, {name: 'output2', operand: output2}]);
const compilation = await model.compile();
const buffers = await compilation.compute(['output1']);
// 'buffers' is sequence<ArrayBufferView> 
// 'buffers[0]' holds a filled CPU buffer of just the content of output1

huningxin commented 4 years ago

I don't recall why we would need execution in addition to compilation

The previous discussion is at https://github.com/webmachinelearning/webnn/issues/39#issuecomment-589963045. Probably we need to revisit that.

also simplify the method names for readability

+1

Can this work?

Two opens:

The inputs are not set for a compute. Is it missed?
If the output is on GPU, returning sequence<ArrayBufferView> would download the data to CPU memory. It might not able to allow the use case @wchao1115 mentioned in https://github.com/webmachinelearning/webnn/issues/87#issuecomment-688623748

it may want to pass the GPU resource along to other API that may accept it e.g. with ML models that emit output images, etc.

gramalingam commented 4 years ago

I prefer a single "compilation.compute(...)" method (as suggested by Chai).

Re. Ningxin's point 1: yes: we need to supply the input-values as parameters to the "compute" method.

Re. Ningxin's point 2: I think if we want to support this usecase, we have two options. (a) The caller specifies (using parameters to the "compute" method) whether they want the output copied to CPU or not. (b) We don't do any copying in "compute" and the caller copies the results to the CPU separately (if they want to).

I prefer option (b). So, basically, generalize Chai's suggestion to:

const outputs = await compilation.compute(input_values, desired_output_names);

where outputs is a sequence of output_type, and output_type is a union that includes tensors in GPU as well as CPU etc.

I think it is useful to have a type representing a tensor, regardless of where it is located.

wchao1115 commented 4 years ago

The inputs are not set for a compute. Is it missed?

I omitted the nn.input part in my sample code above for brevity. The input side is the same as before. If we want to accept an external resource for an input operand, we could add new methods e.g. nn.inputFromXXX in the future.

we need to supply the input-values as parameters to the "compute" method.

Not sure why. I don't think we need to pass the input operands to compute as it's already known and labeled in the graph.

it may want to pass the GPU resource along to other API that may accept it e.g. with ML models that emit output images, etc.

I looked at both canvas-2d and canvas-WebGL to understand how to move the output tensor data to either endpoint, and it looks like both paths do start with ArrayBufferView, which unfortunately would require a copy. But in most known scenarios the bottlenecks are on how to get the data in to the model. So requiring copy-out may not be too bad. Future resource sharing with WebGPU may be a possibility, but that would probably need to happen at the context level to ensure cross-adapter resource sharing. I think assuming selective CPU downloads on outputs is probably a safe bet.

gramalingam commented 4 years ago

Not sure why. I don't think we need to pass the input operands to compute as it's already known and labeled in the graph.

We know what the inputs (that is, the input-variables) of the graph are (when we build the graph), but not the values of these inputs (which are determined only when we call "compute" and will vary from one call of "compute" to another.

wchao1115 commented 4 years ago

@gramalingam the input values are known, right? That's why it's the input.

(copied from the spec sample)


// Setup the input buffers with value 1.
const inputBuffer1 = new Float32Array(TENSOR_SIZE).fill(1);
const inputBuffer2 = new Float32Array(TENSOR_SIZE).fill(1);

// Associate the input buffers to model’s inputs.
execution.setInput('input1', inputBuffer1);
execution.setInput('input2', inputBuffer2);

gramalingam commented 4 years ago

Yes, we can specify the input values using execution.setInput(...) . But if we want to eliminate execution and directly call compute from a compilation, we need some mechanism to achieve the same goal as execution.setInput.

wchao1115 commented 4 years ago

Yes, we can specify the input values using execution.setInput(...) . But if we want to eliminate execution and directly call compute from a compilation, we need some mechanism to achieve the same goal as execution.setInput.

That would be Compilation.setInput. I just thought Compilation and Execution are redundant. It looks like we can make do with just 1 stateful object instead of 2.


interface Compilation {
  void setInput(DOMString name, ArrayBufferView data);
  Promise<sequence<ArrayBufferView>> compute(sequence<DOMString>);
};

huningxin commented 4 years ago

I prefer a single "compilation.compute(...)" method (as suggested by Chai).

+1.

Re. Ningxin's point 1: yes: we need to supply the input-values as parameters to the "compute" method.

+1. This would make the Compilation stateless.

(b) We don't do any copying in "compute" and the caller copies the results to the CPU separately (if they want to).

+1. This would enable passing the GPU resource along to other API even following compute method.

I think it is useful to have a type representing a tensor, regardless of where it is located.

+1. I propose to add Tensor interface and Compilation.compute on tensors instead of ArrayBufferView.

There are some usages with code sketches (updated).

Create input and output tensors for a compilation:

// Create input tensors with full specified shape.
const tensor1 = compilation.tensor(inputShape1);
const tensor2 = compilation.tensor(inputShape2);
// Create output tensor without specifying shape.
const tensor3 = compilation.tensor();

Write data into the input tensors.

const inputBuffer1 = new Float32Array(sizeOfShape(inputShape)).fill(value);
tensor1.writeFrom(inputBuffer1);
// Similar code for tensor2.

Compute is not async, as reading back of tensor is async.

compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});

Async read back the output into a buffer

const outputBuffer = new Float32Array(sizeOfShape(outputShape));
await tensor3.readInto(outputBuffer);

If the output shape is dynamic, users could query the shape before allocating buffer.

// Might wait for shape inference done.
const outputShape = await outputTensor.getDimensions();
const outputBuffer = new Float32Array(sizeOfShape(outputShape));

A following compute could reuse the tensors without creating new ones.

// The second inference.
tensor1.writeFrom(inputBuffer1);
tensor2.writeFrom(inputBuffer2);
compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});
await tensor3.readInto(outputBuffer);

An output tensor could be passed to another compilation's (different graph) compute without downloading its data.

tensor1.writeFrom(inputBuffer1);
tensor2.writeFrom(inputBuffer2);
compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});
compilation2.compute({'input': tensor3}, {'output': tensor4});
// Only download the final result.
await tensor4.readInto(outputBuffer);

wchao1115 commented 4 years ago

@huningxin It seems a key difference in this discussion so far is whether we should define a notion of Tensor in the API as a dedicated resource that can be implemented and passed in and out of the API, potentially enabling resource sharing with other API.

This idea is conceptually sound with slightly greater complexity, but my concern is on the implementation side, especially for the GPU implementation of the API. The first problem is that you actually don't want to map a Tensor to a GPU resource like a UAV buffer, one to one. In practice it's far more common to pool and reuse a single UAV buffer for multiple sets of tensor data with proper alignments. Frequent creation of GPU resources is prohibitively expensive.

Secondly, activities such as upload/download of tensor data from/to CPU buffers are commonly pipelined in the same command list along with shader dispatches that carry out the computation of graph operations, so that the entire command list can be pipelined and executed in one go. It would be highly inefficient to break up the command list just to upload the tensor data for instance.

And lastly, there are some type of tensor data e.g. constant weight data that sometimes requires special treatment depending on the underlying hardware; a process that involves an additional copy pass before issuing shader dispatches. It's hard to properly locate the timing of that copy pass when the tensor data is associated with the Tensor API objects that can operate independently from the rest of the graph execution process.

Based on this set of difficulties, I'm still advocating against defining a Tensor type at the API level and instead continuing with the current approach of explicit upload/download methods off the Compilation interface.

huningxin commented 4 years ago

@wchao1115 , thanks for sharing your thoughts on GPU implementation that is extremely helpful.

The first problem is that you actually don't want to map a Tensor to a GPU resource like a UAV buffer, one to one.

+1. I think we might not need to maintain the one to one mapping between a Tensor and a GPU resource. A tensor could just "record" the logical resource usage, such as input or output of a compute. The real resource allocation is up to the implementation.

Secondly, activities such as upload/download of tensor data from/to CPU buffers are commonly pipelined in the same command list along with shader dispatches that carry out the computation of graph operations

That's a good point.

To test my understanding, I would slightly change the sketch code of the inference loop as:

tensor1.writeFrom(inputBuffer1);
tensor2.writeFrom(inputBuffer2);
compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});
await tensor3.readInto(outputBuffer);

When running above code, a GPU implementation might be able to delay the execution until the last step (await tensor3.readInto(outputBuffer)). When user wants to read back the data, the GPU implementation could pipeline the CPU memory upload for tensor1 and tensor2, shader dispatches of compiled operations of compilation and CPU memory download from tensor3 into one command list, and execute it in one go .

And lastly, there are some type of tensor data e.g. constant weight data that sometimes requires special treatment depending on the underlying hardware;

The constant data would not be represented by Tensor. We use nn.constant to associate the constant data to a model. For GPU implementation, the special treatment including an additional copy pass could be handled in model.compile.

Does this work?

wchao1115 commented 4 years ago

I think we might not need to maintain the one to one mapping between a Tensor and a GPU resource. A tensor could just "record" the logical resource usage, such as input or output of a compute.

In practice the backing implementation of the API Tensor type would most likely be just a sort of handle with a refcount pointer to the context, which complicates the implementation since there will need to be additional bookkeeping going internally. It'll also tax resource management and garbage collection with small handles moving around. The API object gives a false impression that it can do something useful.

When running above code, a GPU implementation might be able to delay the execution until the last step (await tensor3.readInto(outputBuffer)).

That is very unintuitive and misleading to developers. It is very natural to expect that operations like shader dispatches happen at compute. The download copy to a CPU buffer should be entirely optional. I'm hesitated for an API that doesn't lend itself well to intuition. It is very important to convey the intent of the API so people knows what to expect.

The constant data would not be represented by Tensor. We use nn.constant to associate the constant data to a model.

Understood. I'm not suggesting that we change the notion of nn.constant. I'm pointing out that during graph execution the implementation will need to tag the tensors backing constant operands for that extra initialization stage as many GPU drivers pre-process weight data ahead of execution. By splitting up a notion of API Tensor, it's one additional thing to keep track on during graph execution.

kpu commented 4 years ago

I'm pointing out that during graph execution the implementation will need to tag the tensors backing constant operands for that extra initialization stage as many GPU drivers pre-process weight data ahead of execution. By splitting up a notion of API Tensor, it's one additional thing to keep track on during graph execution.

This "pre-process weight data" is called packing #86 and also happens with CPU drivers. To generalize slightly, the tensor need not be a constant but simply reused enough to justify packing it. For example, a transformer encoder's outputs will be consulted repeatedly by a transformer decoder.

You appear to be assuming that graph compilation is the right place to pack parameter matrices. Is your graph compilation general enough to support graph structures that differ for every input like syntactic parsing and the existence of the dynet toolkit? https://github.com/clab/dynet/ If not, then then graph compilation is not the place to stuff packing.

wchao1115 commented 4 years ago

You appear to be assuming that graph compilation is the right place to pack parameter matrices. Is your graph compilation general enough to support graph structures that differ for every input like syntactic parsing and the existence of the dynet toolkit?

@kpu WebNN only defines a contract between web applications and the browsers. It doesn't prescribe how a graph compilation step is to be implemented. That will be up to the implementation of this contract in the browser. What I'm saying is:

Packed weight formats vary depending on the underlying hardware and because of that it may not be suitable to surface the in a web API contract. In the case of the GPU, different drivers, sometimes even from the same IHV vendor, use different pre-processed formats as it also depends on the hardware generation, the weight pre-processing step is carried out between the OS and the GPU drivers today.
The graph compilation step as defined in the contract as model.compile is a suitable timing for the browser implementation to mediate weight preprocessing with the OS.

kpu commented 4 years ago

1. Packed weight formats vary depending on the underlying hardware and because of that it may not be suitable to surface the in a web API contract. In the case of the GPU, different drivers, sometimes even from the same IHV vendor, use different pre-processed formats as it also depends on the hardware generation, the weight pre-processing step is carried out between the OS and the GPU drivers today.

I am not suggesting the pre-processed format be exposed to the user. I am suggesting the user hint that it should be done and the format be opaque to the user.

2. The graph compilation step as defined in the contract as `model.compile` is a suitable timing for the browser implementation to mediate weight preprocessing with the OS.

There are models that create a new, different, graph for every input. The most basic example of this is that sentences have different lengths. A more extreme example is running a syntactic parser then using the parse as a graph structure. So does graph compilation support these cases in one compilation? If not, then graph compilation is the wrong place because weight packing should happen once per weight tensor, not once per graph per weight tensor.

wchao1115 commented 4 years ago

That's a good point. There are models whose behavior depending on input shapes which may not be known until execution time e.g. free dimensions, ROI, variable length RNN, etc. For such models, not every part of the graph can be finalized at the compile step until after all the inputs are bound to it at execution time. But even so, the compile step is still per-model and not per-graph; constant inputs are fixed-shape and therefore can still be pre-processed at compile time.

huningxin commented 4 years ago

@wchao1115

That would be Compilation.setInput. I just thought Compilation and Execution are redundant. It looks like we can make do with just 1 stateful object instead of 2.

I understand there is only 1 stateful object (Execution) in today's spec. The Compilation has not been stateful since we landed https://github.com/webmachinelearning/webnn/pull/46.

interface Compilation {
  void setInput(DOMString name, ArrayBufferView data);
  Promise<sequence<ArrayBufferView>> compute(sequence<DOMString>);
};

This Compilation proposal would work if we eliminate Execution. However it still needs to add the support of input/output with dynamic shape that is requested by @pyu10055 .

If we keep Execution, my proposal of dynamic shape support in Execution was raised in https://github.com/webmachinelearning/webnn/issues/87#issuecomment-688612462. I am pasting it here for the convenience.

 interface Execution {
-  void setInput(DOMString name, ArrayBufferView data);
+  void setInput(DOMString name, ArrayBufferView data, optional sequence<long> shape);
   void setOutput(DOMString name, ArrayBufferView buffer);
   Promise<void> startCompute();
+  sequence<long> getOutputShape(DOMString name);
+  Promise<void> readOutput(DOMString name, ArrayBufferView buffer);
 };

Another difference is either returning a sequence of ArrayBufferView or writing into the pre-allocated ArrayBufferView. The approach of writing into the pre-allocated buffer would avoid creating new ArrayBufferViews in each compute. And when passing the output to WebAssembly code, e.g. for post-processing or custom operators, this approach would also avoid JS to WebAssembly memory copy, as discussed in https://github.com/w3c/machine-learning-workshop/issues/93 and https://github.com/WebAssembly/design/issues/1162.

wchao1115 commented 4 years ago

That's a fair point @huningxin. It sounds like we agree that there needs to be a way to handle dynamic input/output shape whether we want to collapse Compilation and Execution into one.

A note on perf though, dynamic input shape is bad for perf since the model defers the input shape until after graph compilation. This makes it hard for the platform to appropriately pipeline the graph at compile time. So, yes there are models that do that today, but it's not good from the performance point of view.

gramalingam commented 4 years ago

Chai referred to "resource management and garbage collection". I think this is an important point worth fleshing out, regardless of which interfaces we adopt as discussed above.

Specifically, how does memory allocation and deallocation happen and to what extent does the web application have control and responsibilities over this? Is there an accepted standard for this in other existing browser IDL APIs that will suffice and work for us? (That is, since garbage-collection may not be good enough for the underlying buffers.)

kpu commented 4 years ago

A note on perf though, dynamic input shape is bad for perf since the model defers the input shape until after graph compilation. This makes it hard for the platform to appropriately pipeline the graph at compile time. So, yes there are models that do that today, but it's not good from the performance point of view.

Dynamic input shape is not, in it of itself, bad for perf. Compilers that run once are bad for perf of models that exist in the wild. A (theoretical) good compiler runs ahead of time but also memoizes partial compilation so that just-in-time changes can be made in response to dynamic shape. I would suggest Compilation and Execution be collapsed because it will make abstracting and blurring ahead-of-time and just-in-time approaches easier, instead of enforcing an ahead-of-time compilation as part of the API.

wchao1115 commented 4 years ago

@kpu JIT is always more flexible as long as its overhead is properly managed. However, there are also important use cases where the apps want to pre-compile and reuse the compiled models over and over again in high-bandwidth call paths e.g. real-time segmentation, super-resolution, etc.; dynamism in the models e.g. free input dimensions, variable sequences, etc. is troubling for that type of scenario. I think we should design the API so that deferred compilation is possible and pre-compiled is efficient.

kpu commented 4 years ago

Execution implies compilation right? What is the performance benefit of an explicit compilation step vs an implicit cached one?

wchao1115 commented 4 years ago

Compilation takes time. An explicit compile step gives apps more control over when to spend that time, which is quite important for end-to-end perf. Given a model that lends itself well to static compilation (i.e. ahead-of-time), it's much more efficient to have the model compiled ahead of execution such as during app's initialization, page load, or model's construction time, while execution may actually happen a lot more frequently afterward with the pre-compiled models e.g. every visual updates.

JIT also tends to have management overhead cost from figuring out when and how to evict the cache and re-JIT things down, with the worst case being that you always have to JIT the model before execution, which could be really inefficient.

It does, however, simplify the API and make it easier to use. You could eliminate the compile step altogether. But it comes at a cost of having less control. In the context of the GPU, it's almost always preferable to pre-compile because dynamic processes like fusions, layout assignments, and weight packing, could break GPU pipelining. It's still doable but with more caveats and care to prevent bubbles.

So, I think we still need to keep the compile step. We should be able to fold Compilation and Execution into one though.

huningxin commented 4 years ago

So, I think we still need to keep the compile step.

+1

We should be able to fold Compilation and Execution into one though.

+1

Additionally, we may consider to remove the Model interface as the graph info already exists. With that, the graph compilation code would be simplified as:

 const a = nn.input('a', descriptorA);
 const b = nn.constant(descriptorB, bufferB);
 const c = nn.matmul(a, b);
-const model = await nn.createModel([{name: 'c', operand: c}]);
-const compilation = await model.createCompilation({powerPreference: 'low-power'});
+const compilation = await nn.compile({'c': c}, {powerPreference: 'low-power'});

We may also consider to make the folded Compilation interface stateless by supplying input buffers to compute method.

With that, for pre-allocated output buffers, the graph execution code would be simplified as:

-const execution = await compilation.createExecution();
-execution.setInput('a', bufferA);
-execution.setOutput('c', bufferC);
-await execution.startCompute();
+await compilation.compute({'a': bufferA}, {'c': bufferC});

The above code should be able to cover the existing use cases of today's spec. The following samples are for new requirements: 1) no pre-allocated output buffers; 2) dynamic shaped input/output.

If the pre-allocated output buffers are not supplied, compute would return the output buffers with shape info. For example:

let results = await compilation.compute({'a': bufferA});
console.log(results.c.shape);
console.log(results.c.buffer);

For dynamic shaped inputs, compute would allow to specify the input shape. For example:

let results = await compilation.compute({'a': {shape: shapeA, buffer: bufferA}});
console.log(results.c.shape);
console.log(results.c.buffer);

Because the pre-allocated buffer would avoid extra memory copy to WebAssembly memory, it would be useful to support dynamic output shape for this usage as well. One option is compute still returns the inferred output shape when pre-allocated buffer provided.

let results = await compilation.compute({'a': {shape: shapeA, buffer: bufferA}}, {'c': bufferC});
console.log(results.c.shape);

If the size of pre-allocated buffer is not big enough, compute would reject the promise with an error. The error may contain the inferred output shape that helps users to re-allocate buffer.

try {
  results = await compilation.compute({'a': {shape: shapeA, buffer: bufferA}}, {'c': bufferC});
} catch (err) {
  bufferC = new Float32Array(sizeOfShape(err.c.shape));
  results = await compilation.compute({'a': {shape: shapeA, buffer: bufferA}}, {'c': bufferC});
}

wchao1115 commented 4 years ago

Just to make sure I understand. You're talking about something like this?

dictionary Input {
  required DOMString name;
  required ArrayBufferView buffer;
  sequence<long> dimensions; // optional
};

dictionary Output {
  required DOMString name;
  ArrayBufferView buffer; // optional
  sequence<long> dimensions; // optional
};

interface Compilation {
  Promise<sequence<Output>> compute(sequence<Input> inputs, optional sequence<Output> outputs);
};

Input buffers are required but input shapes are not
Neither output buffers nor shapes are required. But if specified, use them
When outputs is not specified, return all model outputs with buffers and shapes

That could work.

huningxin commented 4 years ago

@wchao1115

You're talking about something like this?

Semantic wise, yes. There are some syntax tweaks. I propose to use record<K, V> that allows the literal syntax for mapping name to input/output.

dictionary Input {
  required ArrayBufferView buffer;
  sequence<long> dimensions; // optional
};

dictionary Output {
  ArrayBufferView buffer; // optional
  sequence<long> dimensions; // optional
};

typedef record<DOMString, Input> Inputs;
typedef record<DOMString, Output> Outputs;

interface Compilation {
  Promise<Outputs> compute(Inputs inputs, optional Outputs outputs);
};

With that, user could pass inputs like following code:

let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}});

and can access output by named property:

console.log(results.c.dimensions);
console.log(results.c.buffer);

WDYT?

pyu10055 commented 4 years ago

@huningxin @wchao1115 Are we combining the compilation and execution into a single method compilation.compute? For users to supply the Input/Output parameters, do they need to know the shape prior calling the compute method?

pyu10055 commented 4 years ago

@huningxin @wchao1115 Slowly reading through the thread, I think the proposal solves my concerns with the original API.

One extra question is the concept of compilation to the end user, it looks like the compilation become implicit, since it is combined with execution. It is not clear to me why we need to introduce this concept. Especial for end users do not care about controlling the compilation, it will much clear to have an execute method directly on the Model interface.

  model.execution(Inputs inputs, optional Outputs outputs);

huningxin commented 4 years ago

@pyu10055

Are we combining the compilation and execution into a single method compilation.compute?

No. They are separated.

The compilation would be done by nn.compile as:

const a = nn.input('a', descriptorA);
const b = nn.constant(descriptorB, bufferB);
const c = nn.matmul(a, b);
const compilation = await nn.compile({'c': c}, {powerPreference: 'low-power'});

The execution is done by compliation.compute as:

let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}});
console.log(results.c.dimensions);
console.log(results.c.buffer);

For users to supply the Input/Output parameters, do they need to know the shape prior calling the compute method?

It would not require users to know the output shape prior calling the compute method. According to compute's signature, Promise<Outputs> compute(Inputs inputs, optional Outputs outputs), when the outputs parameter is not provided, the compute method would return the output buffer and shape as above sample code shows.

For static input shape, for example

const a = nn.input({type: 'tensor-float32', dimensions: [2, 2]});

The users can just provide the buffer when calling the compute method, such as

let results = await compilation.compute({'a': {buffer: bufferA}});

For dynamic input shape, for example

const a = nn.input({type: 'tensor-float32', dimensions: [-1, 2]});

The users need to provide both buffer and shape, such as:

let results = await compilation.compute({'a': {buffer: bufferA, dimensions: [2, 2]}});

pyu10055 commented 4 years ago

@huningxin Actually this has gone back a little bit, take following graph as example:

const a = nn.input('a', descriptorA);
const b = nn.constant(descriptorB, bufferB);
const c = nn.matmul(a, b);
const d = nn.matmul(a, c);
const compilation = await nn.compile({'d': d}, {powerPreference: 'low-power'});

The compile is done for the full graph, now I want to execute the subgraph which has c as the output, can the previous compilation used for this new execution:

let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}}, {'c': {...}));
console.log(results.c.dimensions);
console.log(results.c.buffer);

Or the user need to create a new compilation?

const compilation2 = await nn.compile({'c': c}, {powerPreference: 'low-power'});

huningxin commented 4 years ago

@pyu10055

Or the user need to create a new compilation?

Yes. The users could create a new compilation for that subgraph.

I am thinking about an alternative way. That allows to specify multiple outputs when compile, such as:

const compilation = await nn.compile({'c': c, 'd': d}, {powerPreference: 'low-power'});

The users may just compute either c or d.

let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}}, {'c': {}));
console.log(results.c.dimensions);
console.log(results.c.buffer);

results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}}, {'d': {}));
console.log(results.d.dimensions);
console.log(results.d.buffer);

Would this work?

pyu10055 commented 4 years ago

@huningxin This would work, but from usability point of view, why users need to care about the compilation step? Can the compilation be an internal implementation detail that is hidden from the end users? The reason is that I don't see the benefit of calling compile + compute API, comparing to a single API:

  model.execution(Inputs inputs, optional Outputs outputs, {powerPreference: 'low-power'});

huningxin commented 4 years ago

@pyu10055

why users need to care about the compilation step?

I think the major reason is compilation takes time and the users may want to handle that. @wchao1115 shared more insights at https://github.com/webmachinelearning/webnn/issues/87#issuecomment-693699105

One example is, with webnn-polyfill, the compilation time and first inference time of LetNet example are:

compilation elapsed time: 582.70 ms
execution elapsed time: 27.80 ms

As we discussed, as there is no dedicated compilation step of tf.js, the webnn-polyfill emulates the compilation by letting tf.js infer once. The first inference would compile the WebGL kernels, so it takes much longer time comparing to following inferences. I guess the compilation in native API would be similar.

wchao1115 commented 4 years ago

This would work, but from usability point of view, why users need to care about the compilation step?

It comes down to the choice between ease-of-use vs. control. A separate compile step allows more control for the users of the API to decide when would be an ideal time to spend on model compilation. For the scenarios described earlier in the thread, it is important to keep it.

@huningxin I'm not familiar with record<K,V>. Is it supported in the w3c webidl spec? Can you point me to the standard definition? I'm not sure what the c in this expression results.c.dimensions stands for.

webmachinelearning / webnn

Model Execution API #87