Closed wchao1115 closed 4 years ago
Thanks @wchao1115 for bringing up this discussion.
MobileNet is one of the first-wave models. The existing conv2d
doesn't support its depthwise convolution. Adding groupCount
parameter could fix this issue.
From the input dimension perspective, we may consider to support 1D and 3D convolution as well.
Thanks @huningxin. I think 1D and 3D are different enough to warrant its own operators i.e. conv1d
and conv3d
. We can probably fit the 3D case within conv2d
by accepting the depth dimension in the filter tensor, but that would make the condition more complicated to detect and to be efficient at it at the execution time. Keeping the dimensionality difference on different operators could also, in theory, make runtime's graph fusion a bit easier.
Keeping the dimensionality difference on different operators could also, in theory, make runtime's graph fusion a bit easier.
This is a good point. 👍
On the other hand, supporting N spatial dimensions in one conv
operator would reduce the number of core operators to be supported. I noted both Conv of ONNX and Convolution of XLA-HLO support N spatial dimensions. Those would be good design references.
I can see how depth can be integrated to the existing 2D API. In fact, that's how DirectML does it too. However, I have yet to find an IHV willing to implement depth in the same call path as the rest of conv2d, which makes me doubt the potential benefit of overloading depth in the 2D API. 1D however is a real special case. I can't see it integrated into the current API without making it much harder to explain.
According to resolution 01 of WebML CG Teleconference – 28 May 2020 that
RESOLUTION: Add grouped convolution support to existing conv2d definition in WebNN API
The way we define
conv2d
today is sufficient for a typical usage of convolution. However, there are a couple variants of the convolution operation we should consider support.groupCount
param is needed, where the filter tensor shape becomes [out_channels / group_count, group_count, in_channels / group_count, H,W] when group_count > 1.Note that depthwise convolution, one used by MobileNet is implemented today as 2 passes convolution with the first depthwise pass done with the
groupCount
set to the number of input channels, while the second pointwise pass is simply a convolution with 1x1 filter kernel size.