webmachinelearning / webnn

🧠 Web Neural Network API
https://www.w3.org/TR/webnn/
Other
368 stars 45 forks source link

Simplify the operand layout support of conv2d and pooling 2d operations #324

Open huningxin opened 1 year ago

huningxin commented 1 year ago

In the existing WebNN spec, conv2d supports two input operand layouts defined by MLInputOperandLayout and four filter operand layouts defined by MLConv2dFilterOperandLayout.

enum MLInputOperandLayout {
  "nchw",
  "nhwc"
};

enum MLConv2dFilterOperandLayout {
  "oihw",
  "hwio",
  "ohwi",
  "ihwo"
};

This may make the implementation more complicated especially if a native ML framework or OS API doesn't support some of these layouts. If one layout is unsupported, the implementation may need to insert the transpose operations into the graph around the conv2d operation that transposes the unsupported layout to supported one. This would easily lead to an inefficient graph representation that may have redundant transpose operations. Or the implementation may need to optimize the graph by techniques such as "transpose sink" which may require a more complex implementation. This issue was raised in Chromium CL review.

To simplify the implementation, the proposal is to reduce the supported operand layouts, for example, just keep the default one. Because WebNN supports transpose operation, the layout adaption and graph level optimization can be handled by ML frameworks that usually already support such functionalities.

Thanks @wacky6 for this idea.

anssiko commented 1 year ago

This issue was discussed at the WebML WG Teleconference – 16 March 2023. Summary: Awaits further implementation feedback.

fdwr commented 1 year ago

Picking just one preferred layout in WebNN could make life easier for the calling framework and the underlying backend implementation, or it could make it harder for both:

I prefer accepting both (keeping the current spec), but it would be informative to see a holistic table each major framework's preferred layout and each backend's preferred layout.

[updated...] Table added (✅ == default):

API NCHW NHWC Notes
CoreML NCHW✅ - -
ONNX NCHW✅ - -
PyTorch NCHW✅ NHWC NCHW is default, but NHWC is supported more recently via torch.memory_format.channels_last
TensorFlow & Tensorflow.js & TFLite NCHW NHWC✅ data_format defaults to NHWC or channelsLast.
TFLite - NHWC✅ -
Intel OneDNN dnnl_format_tag_t NCHW✅ NHWC "NCHW is the recommended data layout"
CuDNN cudnnTensorFormat_t NCHW✅ NHWC NCHW is default order for cudnnSetTensor4dDescriptor
DirectML NCHW✅ NHWC NHWC is default, but NHWC is supported via explicit strides
XNNPack - NHWC✅ -
NVIDIA tensor cores NCHW NHWC✅ "Tensor Cores are fastest when input tensors are laid out in NHWC ... NCHW layouts can still be operated on by Tensor Cores, but include some overhead due to automatic transpose operations"
anssiko commented 1 year ago

@fdwr thanks for sharing your preference and the supporting details.

As an aside, I encourage incorporating considerations such as this into specification informatively alongside the normative prose. It helps explain the specification to people who look at it without the full context active WG participants have.

wacky6 commented 1 year ago

Layout support comes up in MLOperand implementation that allows data shape broadcasting. https://chromium-review.googlesource.com/c/chromium/src/+/4396686/comment/f02acaeb_3c2795f2/

Supporting both channel-first and channel-last layout will complicate spec steps and implementation because the current numpy broadcast rule implies right-most first broadcast.

Example: caller wants to apply a per-channel multiplication.

  1. lhs is nchw{1,3,4,4}; caller provides rhs {3}. This will fail under the current broadcast rule. The caller will need to broadcast rhs itself.
  2. lhs is nhwc{1,4,4,3}; caller provides rhs {3}. Works as intended.

How to support case 1 isn't clear. Some questions might help decision:

I have a slight preference for supporting only one layout (NHWC to be precise).

wacky6 commented 1 year ago

I want to share a data point.

I was playing with Real-ESRGAN today, and found out that torch.compile channel_last layout is faster than torch.compile channel_first layout on my NVIDIA A4000.

I'm not sure how well this transfer to other models (ESRGAN is heavily based on CNN + residual connection) though.

I wonder if we should benchmark on channel ordering on different hardware (i.e. vendor other than NVIDIA could optimize for channel_first).

Or maybe this won't matter if graph builder (or rather optimizer) is "clever" enough.

huningxin commented 1 year ago

There is a security perspective from @quidity (Thanks Alex!) in Chromium CL-4653303: WebNN: Define conv2d operator in mojo review.

Alex mentioned:

enum Conv2dFilterOperandLayout { kOihw, kHwio, kOhwi, kIhwo }; this feels very error prone - is there a better way to represent the layout of the data at this stage or restrict the ways that is presented to the privileged process?

wacky6 commented 1 year ago

FWIW, another way to tackle layout is to tell the implementation which layout should be used, like: https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html

This could be a hint to GraphBuilder.build() (right before producing a graph that can be passed to compute()).

--

Taking a step back, I still strongly prefer a single unified layout (i.e. NHWC) that's applied throughout MLGraphBuilder methods, and let the backend (e.g. DMLImpl) change the layout (if necessary) before sending to hardware.

zolkis commented 1 year ago

From the end users' perspective, I sympathize with the single default layout idea, selecting the most supported one in the industry, and let the back-end make any changes needed, making layout an implementation detail. End users might need to convert layout in a few corner cases.

However, from an API design standpoint, there is also the question what the clients of this API will want to control, i.e. what the API should expose.

A user-facing API is more free to make simplifications by conventions or by re-framing the user interaction. A lower level API, such as this, should be simple and transparent without added "smartness", therefore on the long term it might be better to support the union of the use cases.

In the comment above there are arguments the single default layout might also simplify the usage of the API.

When not sure, in Web APIs usually it is a good practice to start with the simpler API and then extend it on need, making sure extensibility is possible by design, i.e. no API breaks.

wchao1115 commented 1 year ago

The design intent of WebNN as a backend API prioritizes completeness, efficiency, and expressiveness over ease of use. For instance, automatic shape inference is not supported as it was assumed to be the responsibility of the calling framework or app calling into WebNN directly. This limitation while not as easy to use allows the API to be more flexible and adaptable to different framework policies.

I agree with the premise that having excessive layout options makes the API harder to implement. I think reducing the filter layout options MLConv2dFilterOperandLayout to just the first two options "oihw" and "hwio" makes sense. The first is the default filter layout of torch and ONNX, while the second is of TensorFlow.

However, a trickier conversation is on the input layout. Interestingly, the first option "nchw" is also the default in torch and ONNX while the second option "nhwc" is supported natively in TensorFlow, historically influenced by the design of NVIDIA Tensor Core's FP16 native layout in 2017, started with the Volta generation -- the Titan-V. It isn't a mainstream layout supported by all other vendors, just a very common one with NVIDIA GPU over FP16 tensor data type.

There are TensorFlow models nowadays that still rely on NHWC input layout. These models once converted to the other format often results in each conv2d layer bracketed by a pair of transposes, a superfluous outcome at first glance, but easily collapsible by an intelligent backend later on. On the other hand, allowing NHWC layout to populate down through the graph, could potentially push the layout mismatches down further in the stack and makes it harder for the implementer to detect and optimize the unneeded double transposes away.

I support the removal of the input layout enum MLInputOperandLayout by assuming the input layout to always be NCHW. WebNN already supports the Transpose operator whenever the layout change is required as needed by a layout converter.