webmachinelearning / webnn

🧠 Web Neural Network API
https://www.w3.org/TR/webnn/
Other
393 stars 47 forks source link

More variations of supported convolution type needed. #60

Closed wchao1115 closed 4 years ago

wchao1115 commented 4 years ago

The way we define conv2d today is sufficient for a typical usage of convolution. However, there are a couple variants of the convolution operation we should consider support.

  1. Grouped convolution, used in AlexNet. A new groupCount param is needed, where the filter tensor shape becomes [out_channels / group_count, group_count, in_channels / group_count, H,W] when group_count > 1.
  2. Tranposed convolution, used in autoencoder or models generating high resolution image e.g. skeleton tracking, etc. Also known as "backward" convolution, used to compute convolution gradient during model training. API needs an extra enum param to support it.

Note that depthwise convolution, one used by MobileNet is implemented today as 2 passes convolution with the first depthwise pass done with the groupCount set to the number of input channels, while the second pointwise pass is simply a convolution with 1x1 filter kernel size.

huningxin commented 4 years ago

Thanks @wchao1115 for bringing up this discussion.

MobileNet is one of the first-wave models. The existing conv2d doesn't support its depthwise convolution. Adding groupCount parameter could fix this issue.

From the input dimension perspective, we may consider to support 1D and 3D convolution as well.

  1. 1D convolution: it can be used for audio data processing for use cases like speech recognition and noise suppression. And it can also be used for sensor data processing for gesture recognition.
  2. 3D convolution: it can be used for video data processing for use cases like action recognition. And it can also be used for 3D image (e.g. depth camera data) processing for object recognition.
wchao1115 commented 4 years ago

Thanks @huningxin. I think 1D and 3D are different enough to warrant its own operators i.e. conv1d and conv3d. We can probably fit the 3D case within conv2d by accepting the depth dimension in the filter tensor, but that would make the condition more complicated to detect and to be efficient at it at the execution time. Keeping the dimensionality difference on different operators could also, in theory, make runtime's graph fusion a bit easier.

huningxin commented 4 years ago

Keeping the dimensionality difference on different operators could also, in theory, make runtime's graph fusion a bit easier.

This is a good point. 👍

On the other hand, supporting N spatial dimensions in one conv operator would reduce the number of core operators to be supported. I noted both Conv of ONNX and Convolution of XLA-HLO support N spatial dimensions. Those would be good design references.

wchao1115 commented 4 years ago

I can see how depth can be integrated to the existing 2D API. In fact, that's how DirectML does it too. However, I have yet to find an IHV willing to implement depth in the same call path as the rest of conv2d, which makes me doubt the potential benefit of overloading depth in the 2D API. 1D however is a real special case. I can't see it integrated into the current API without making it much harder to explain.

huningxin commented 4 years ago

According to resolution 01 of WebML CG Teleconference – 28 May 2020 that

RESOLUTION: Add grouped convolution support to existing conv2d definition in WebNN API

65 is created to support grouped conv2d. Please take a look. Thanks!