Limited support for pad on CoreML backend

pad is implemented on CoreML backend but with lots of constraints. On CoreML (See doc), three modes are supported - constant, reflect, replicate(maps to edge mode).

symmetric mode is not supported (On tflite backend, edge mode is not supported.)
padding for more than the last two dimensions only supports 'constant' mode. (not documented, but it errors out with this message). One of the webnn samples - Fast style transfer need to use >2D padding for reflect mode.
If mode is “reflect” then beginning and ending paddings can be at most input size-1 (same as tensorflow).
If mode is “replicate” (aka edge) then beginning and ending paddings can be at most input size.

For the third constraint, we can probably update the spec to add such constraint? #377

For the others, my first question is - @fdwr @huningxin are there reasonable emulations for these unsupported modes and rank limits?

Further questions are - @mwyrzykowski do you think if CoreML can improve pad support with:

Add support for symmetric mode
Support padding for more than two dimensions for none constant mode.
Remove the padding value constraints for 'replicate' modes since it doesn't make much sense, it only uses the edge values to replicate.

If none of the restrictions can be emulated / nor supported by CoreML, we will need to either: a. Add these constraints to WebNN API. b. Expose the modes limits through opSupportLimits. The rank constraints are tricky to expose - as it varies by mode. c. Just let them fail on CoreML with async error messages.. - Not ideal for browser compatibility.

@fdwr @huningxin are there reasonable emulations for these unsupported modes and rank limits?

"Reasonable" is the debatable part, but all of these can be emulated (no fewer than 3 operations though). Hopefully future CoreML ops can support them more efficiently someday. Here's pseudocode for each - let me know if anything could make more sense. They are all very similar conceptually to texture wrapping in graphics APIs (OGL, D3D), except in higher dimensions. With the decomposition below, any dimension count is supported, and there are no repetition limits to input size.

constant

Use expand to repeat a constant value, then concatenate the edges along each dimension.

result = input
for each axis in input tensor rank
    // If padding present for current dimension on either the low or high end.
    if beginningPadding[axis] != 0 || endingPadding[axis] != 0
        lowChunkDimensions = projectToRank(beginningPadding[axis], input.rank, axis)
        highChunkDimensions = projectToRank(endingPadding[axis], input.rank, axis)

        lowChunk  = expand(scalarTensor, lowChunkDimensions)
        highChunk = expand(scalarTensor, highChunkDimensions)
        result    = concat({lowChunk, result, highChunk}, axis)
    endif
endfor

// projectToRank is a little helper that projects a dimension value up to a given rank at the target axis,
// returning a broadcast-compatible (and concat compatible) a new dimension list.
// e.g. dimension size = 3, rank = 4, axis = 2, output = [1,3,1,1]

// Note enumerating the axis in reverse order (e.g. 3 to 0 for a 4D tensor, rather than 0 to 3) has a slight
// perf benefits, because of the nearer adjacency of elements of higher dimensions.

edge

Take a slice of the very edges, expand that slice on both sides, then concatenate the fragments along each dimension.

result = input
for each axis in input tensor rank
   // If padding present for current dimension on either the low or high end.
    if beginningPadding[axis] != 0 || endingPadding[axis] != 0
        dimension = input.dimensions[axis]

        lowChunkStarts      = projectToRank(0, input.rank, axis)
        lowChunkEnds        = projectToRank(1, input.rank, axis)
        highChunkStarts     = projectToRank(dimension - 1, input.rank, axis)
        highChunkEnds       = projectToRank(dimension, input.rank, axis)
        lowChunkDimensions  = projectToRank(beginningPadding[axis], input.rank, axis)
        highChunkDimensions = projectToRank(endingPadding[axis], input.rank, axis)

        lowSlice  = slice(result, starts=lowChunkStarts, ends=lowChunkEnds)
        highSlice = slice(result, starts=highChunkStart, ends=highChunkEnds)
        lowChunk  = expand(lowSlice, lowChunkDimensions)
        highChunk = expand(highSlice, highChunkDimensions)
        result    = concat({lowChunk, result, highChunk}, axis)
    endif
endfor

symmetric

Tile a mirrored chunk, then slice the result.

mirroredChunk = input
for each axis in input tensor rank
    if beginningPadding[axis] != 0 || endingPadding[axis] != 0
        mirroredChunk = concat({mirroredChunk, reverse(input, axis)}, axis)
    endif
endfor
repetitions  = compute based on ceil(outputDimensions / mirroredChunk size)
outputStarts = compute based on mirroredChunk size, outputDimensions, and beginningPadding
outputEnds   = compute based on mirroredChunk size, outputDimensions, and endingPadding
result       = slice(tile(mirroredChunk, repetitions), outputStarts, outputEnds)

reflection

Same as symmetric, except that you slice off the very edges before mirroring the inner chunk to tile. So [3,4,5] mirrored becomes [3,4,5,4], and then tiled becomes [3,4,5,4,3,4,5,4,3...].

webmachinelearning / webnn