Operation-specific APIs

jbingham commented 3 years ago

Operation-specific APIs

This is a proposal to define and implement a small number of standalone APIs for individual compute-intensive operations (like convolution 2D and matrix multiplication) that are often the target of hardware acceleration. The APIs would be atomic, and would not be tied to a graph or model loader implementation. It would be up to javascript libraries or WASM to call into these low-level APIs.

Short description

Across many common machine learning models, there are a handful of compute-intensive operations that may account for 90-99% of inference time, based on the benchmarking done for Web NN. If these few operations were offered as standalone APIs, hardware acceleration could give much of the performance benefit with a small simple API surface, without needing to define all of the many other instructions and graph topology needed for a higher-level API like a graph or model loader. As a benefit, it ought to be faster to get this handful of APIs shipped.

JavaScript ML libraries would need to be updated to take advantage of the APIs, just like they can take advantage of Web GL today.

Example use cases

Image classification typically uses convolution and matrix multiplication. With hardware accelerated versions of these two operations, the performance boost would be close to the optimal that could be achieved with a complete graph or model execution API.

A rough idea or two about implementation

Maybe the closest example is Web GL compute shaders, except that these operations would be much simpler.

anssiko commented 3 years ago

This proposal is on the 2021-01-07 call agenda.

wchao1115 commented 3 years ago

Is it possible to share some example function prototypes of the proposed atomic operation? Given that the proposal calls for an independent functions that are not tied to the graph, it would be informative to understand the nature of the data types that flow through such functions and how it would be consumed by the callers.

jbingham commented 3 years ago

Here's a little more detail, after talking with Ping.

A simple JavaScript API could accept WebGL TextureData as input/output tensors.

// create op const conv2d_op = webml.createOp(inputTextureData, outputTextureData, ...) // cache the op ... // execute the op webml.runKernel(con2d_op, params...)

Texture data definition for WebGL

export interface TextureData { // Required. shape: number[]; dtype: DataType; // Optional. values?: backend_util.BackendValues; texture?: WebGLTexture; /* [rows, columns] shape of the texture, tensors are flatten when stored. / texShape?: [number, number]; }

We can discuss in the next conference call.

pyu10055 commented 3 years ago

Background

As described in this issue for WebML community group, It aims to define a small set of compute-intensive operations (like convolution 2D and matrix multiplication) that are often the target of hardware acceleration. These are atomic APIs, and would not be tied to a graph or model loader implementation.

Since these APIs are not targeting for full graph execution, they are useful for Javascript ML frameworks like TensorFlow.js, allowing them to access OS or hard-level acceleration that can not be achieved through the existing Web APIs (WebAssembly, WebGL or WebGPU).

The typical way for frameworks to utilize the op level API to use some kind of delegation mechanism. The ML framework will be responsible for loading and interpreting the ML model. During the model execution phase, the model executor will iterate through the ops and poll all backends for each op. The backends that support the op will be sorted by their priority, the one with the highest priority will be selected to execute the op. This mechanism will maximize the amount of overall supported ops, but it will face data IO efficiency issues when switching from one backend to the other. The latency of transfer data from one backend to the other could potentially outweigh the acceleration benefit provided by the op API.

In reality, users construct data pipelines for ML related tasks, model execution is one of the steps. For example in an AR task (lipstick try-on), the data pipeline would contain many pre/post processing steps, which typically are taking place on the GPU. Ensure that model execution also happens on the GPU to avoid unnecessary copying between GPU and CPU.

Solution

Graph level API is one way to avoid the IO bottleneck through trying to execute the model graph as large as possible with a single backend. The other way is to lock the op API to a single accelerator (CPU or GPU), in order to reduce IO transferring between accelerators.

Most Javascript ML frameworks try to access hardware acceleration through existing Web APIs , for example WebAssembly for CPU accelerations (SIMD, multi-threading), WebGL or WebGPU for GPU accelerations. For example TensorFlow.js has three backends (WebGL, WebAssembly and WebGPU), models are typically executed within one of the backends.

If WebML op level API targets existing Web API for data binding mechanism, it can be easily incorporated into different backends that Javascript ML platform provides.

CPU - WebAssembly

For example, in WebAssembly, a Tensor can be represented as following:

// Holds the memory offset and the size of a tensor.
struct TensorInfo {
 // Pointer to the bytes where the data is allocated.
 void *memory_offset;
 // Total number of elements.
 const size_t size;

 const float *f32() const {
   return reinterpret_cast<const float *>(memory_offset);
 }

 float *f32_write() { return reinterpret_cast<float *>(memory_offset); }

 const int32_t *i32() const {
   return reinterpret_cast<const int *>(memory_offset);
 }

 int32_t *i32_write() { return reinterpret_cast<int32_t *>(memory_offset); }

 const bool *b() const {
   return reinterpret_cast<const bool *>(memory_offset);
 }

 bool *b_write() { return reinterpret_cast<bool *>(memory_offset); }
 ...
};

The WebML op API will need to provide C++ header file that exposes kernel compilation and execution APIs.

// create op
conv2d_op = webml_create_conv2d(input_tensor_info, output_tensor_info...)
// cache the op
...
// execute the op
webml_run_kernel(con2d_op, params...)

It would likely need to copy the data in and out of the WebAssembly memory heap to place where the acceleration happens (i.e. Intel SIMD instruction in OpenVINO). Since we are only targeting CPU acceleration with WebAssembly, the data transfer is bound within the CPU. Even with the overhead added, the performance should still be much faster than pure WebAssembly implementation.

This would be similar to how currently TensorFlow.js utilizes the XNNPack library from TFLite in WebAssembly. XNNPAck provides CPU acceleration utilizing WebAssembly SIMD for around 20 kernels.

GPU

When the data is already on the GPU, the Tensor data could be stored as texture for WebGL or memory buffer For WebGPU. Supporting these input types could be valuable.

Texture data definition for WebGL

export interface TextureData {
 // Required.
 shape: number[];
 dtype: DataType;
 // Optional.
 values?: backend_util.BackendValues;
 texture?: WebGLTexture;
 /** [rows, columns] shape of the texture, tensors are flatten when stored. */
 texShape?: [number, number];
}

WebML op API would provide a JS API that accepts WebGL TextureData similar to above as input/output tensors.

// create op
const conv2d_op = webml.createOp(inputTextureData, outputTextureData, ...)
// cache the op
...
// execute the op
webml.runKernel(con2d_op, params...)

Drawbacks

For both CPU and GPU, the data sharing is still an issue in/out of the Web APIs. But hopefully since the execution is limited to one accelerator, data copy overhead is minimized.
The data layout from JS side would likely be different from the OS level APIs, it would be better for the API to define the layout or support typical op data formats.

huningxin commented 3 years ago

@pyu10055 , thanks much for sharing the details. It is very helpful to understand the op delegation mechanism of JavaScript ML frameworks, like TensorFlow.js.

I suppose WebNN is able to support this mechanism by single-op graphs. And thanks to the support of pre-allocated buffers, the memories represented by TensorInfo could be used as input and output of WebNN Compilation.compute method. To help the investigation, I put together a simple polyfill of webml_create_conv2d and webml_run_kernel based on WebNN API for CPU - WebAssembly case.

async function webml_create_conv2d(input_tensor_info, filter_tensor_info, output_tensor_info, params) {
  const nn = navigator.ml.getNeuralNetworkContext();
  const builder = nn.createModelBuilder();
  const input = builder.input('input', {type: 'float32', dimensions: input_tensor_info.shape});
  const filter = builder.constant({type: 'float32', dimensions: filter_tensor_info.shape}, filter_tensor_info.f32);
  const output = builder.conv2d(input, filter, params);
  const op = builder.createModel({output});
  return {
    type: 'conv2d',
    compiledOp: await op.compile(),
    inputs: {'input': {buffer: input_tensor_info.f32}},
    outputs: {'output': {buffer: output_tensor_info.f32_write}}
  };
}

async function webml_run_kernel(op) {
  // op.type === 'conv2d'
  await op.compiledOp.compute(op.inputs, op.outputs);
}

// Emulate the heap of WebAssembly code
const heap = new WebAssembly.Memory({initial: 1}).buffer;

// Emulate the `TensorInfo` struct by JS object:
//  - add `shape` field that describes the tensor shape
//  - only support `f32` and `f32_write` for sake of simplicity
const input_tensor_info = {'shape': [1, 1, 5, 5], 'f32': new Float32Array(heap, 0, 25), 'f32_write': new Float32Array(heap, 0, 25)};
const filter_tensor_info = {'shape': [1, 1, 3, 3], 'f32': new Float32Array(heap, 25 * 4, 9), 'f32_write': new Float32Array(heap, 25 * 4, 9)};
const output_tensor_info = {'shape': [1, 1, 3, 3], 'f32': new Float32Array(heap, 34 * 4, 9), 'f32_write': new Float32Array(heap, 34 * 4, 9)};

// create op
filter_tensor_info.f32_write.fill(1);
const conv2d_op = await webml_create_conv2d(input_tensor_info, filter_tensor_info, output_tensor_info);

// execute the op
input_tensor_info.f32_write.fill(1);
await webml_run_kernel(conv2d_op);
console.log(`output values: ${output_tensor_info.f32}`);
// output values: 9,9,9,9,9,9,9,9,9

// execute the op with different input
input_tensor_info.f32_write.fill(2);
await webml_run_kernel(conv2d_op);
console.log(`output values: ${output_tensor_info.f32}`);
// output values: 18,18,18,18,18,18,18,18,18

You can copy and paste above sample code into WebNN code editor and try. Please click Edit button before pasting.

Probably, we could incorporate the op delegation mechanism into WebNN framework use cases and explainer, and support it better with some enhancements, for example:

Indicate that the implementation might need to optimize for the single-op graph, e.g. select the op execution mode if the native ML API supports and optimizes.
Support the device preference, say cpu or gpu, when compiling a graph, to reduce IO transferring between accelerators.
Support the GPU buffers as compute inputs and outputs, for GPU pipeline integration.

Actually, 2 and 3 are not specific for the single-op graph execution, they would also benefit for the multi-ops graph execution.

Any other thoughts? @wchao1115 @anssiko

wchao1115 commented 3 years ago

Thanks @jbingham and @pyu10055 for bringing forward this discussion. It touches on a few issues, which I'll try to summarize it here:

Some existing frameworks may already implement graph, but may still want to offload some performance critical tasks to the underlying platform e.g. convolution, etc. without taking a dependency on the entire webnn graph and operation set.
When an operation is invoked from a framework to the underlying platform, the data type and layout format must be in a form that can be readily usable by the operation without incurring additional copies or conversions.
The execution device implementing the operations and one used by the rest of the framework's graph must be the same device to avoid data transfer overhead.

I think a key question is whether we think addressing these issues would warrant defining a new set of API altogether.

As @huningxin pointed out in his reply, #1 can be addressed simply by allowing the framework direct access to the WebNN convolution operation, and #2 by extending the WebNN API to support native tensor data types, something we would need to consider anyway regardless of this discussion in order to avoid excessive copying in high-bandwidth visual scenarios. Arguably issue #3 is a framework's policy, which may vary among different framework implementations. But by defining WebNN as a graph API, we implicitly influence a single-device design from the get-go, thus side-stepping the issue altogether. We adopted the same mentality when we design the DirectML graph API to allow for additional graph transforms, to avoid device stalling ,and to reduce internal data transfers.

What I want to add here is that we might also need to consider eager execution in WebNN. That way a graph's compile step could also become optional.

huningxin commented 3 years ago

Thanks @wchao1115 for a great summary.

Some existing frameworks may already implement graph, but may still want to offload some performance critical tasks to the underlying platform e.g. convolution, etc. without taking a dependency on the entire webnn graph and operation set.

I'd like to add that the performance critical tasks may involve some kind of operation fusion. For example, the convolution + bias add + activation, are normally fused for performance optimization in native ML APIs, e.g. Convolutinon of oneDNN. The fused ops are also used by JS ML frameworks, e.g. FusedConv2d op of TensorFlow.js. For such kind of fused ops, the JS ML frameworks may still need to create a small webnn graph that wires conv2d, element-wise add and one of the activations, so the implementation could compile that graph into a fused native convolution op.

anssiko commented 3 years ago

In Operation-specific APIs discussion on 18 March 2021 call, we agreed to address the requirements laid out in this proposal in the WebNN API. @huningxin, @wchao1115 and @RafaelCintron will open issues in the webnn repo to track remaining work identified. First changes landed in https://github.com/webmachinelearning/webnn/pull/149 already.

We'll keep this issue open until the requirements have been satisfied. Thanks @pyu10055 and @jbingham for explaining this important use case.

@huningxin feel free to open an issue to update the WebNN API framework use cases accordingly.

huningxin commented 3 years ago

@huningxin feel free to open an issue to update the WebNN API framework use cases accordingly.

Done. https://github.com/webmachinelearning/webnn/pull/154. @pyu10055 , @jbingham @wchao1115 please take a look. Thanks!

anssiko commented 3 years ago

I opened https://github.com/webmachinelearning/webnn/issues/157 in response to @rafaelcintron's comment on our call.

@wchao1115 @pyu10055 please check all requirements derived from this issue have a corresponding webnn issue: https://github.com/webmachinelearning/webnn/issues

webmachinelearning / proposals