xmos / ai_tools

AI applications and tools
Other
25 stars 10 forks source link

Modify Conv2D Depthwise kernel weights layout #207

Closed keithm-xmos closed 3 years ago

keithm-xmos commented 4 years ago

This will allow 1 output channel group of weights to be pre-fetched into RAM and a time. Saves ~30K on MobileNetV1.

astewart-xmos commented 4 years ago

Currently the kernel weights for conv2d_depthwise() have the shape (K_h, K_w, X_c), the dimensions of which correspond to the kernel height, kernel width and input channels respectively. The layout is the standard C layout for int8_t weights[K_h][K_w][K_c].

The trouble with this layout is that the subset of the tensor corresponding to a single channel output group is not contiguous in memory. Rather, the weights needed appear in 16-byte chunks separated by X_c - 16 bytes (because for depthwise Y_c = X_c). So our options are to either pull the entire tensor into RAM to do our computations, or to pull K_h*K_w separate 16-byte chunks into RAM to slice the tensor.

The former obviously has a RAM usage issue. The latter should actually work, in that the slice copied is a valid depthwise weight tensor for a 16 input/output channels. But the problem is that with the current implementation of conv2d_depthwise(), it would have to assume there are only 16 output channels, and thus the output tensor would have to have the shape (((X_c+15)//16), Y_h, Y_w, 16).

There are a couple possible solutions here.

Solution 1: Change the layout of the weight tensor so that it has the shape (((X_c+15)//16), K_h, K_w, 16)

This has the disadvantage that the weights must be 'boggled' from their standard layout before they can be used with conv2d_deep(). It has the advantage that copying over a portion of the weight tensor is just a matter of copying over one contiguous block of memory.

Solution 2: Change conv2d_depthwise() (and the functions it calls) so that the kernel tensor step is independent of input channel count.

This has the advantage that we don't actually have to use a different layout for the kernel tensor. The disadvantage is that multiple blocks of memory must be copied to create the matrix slice. But as far as the kernel functions are concerned, this should be a reasonably easy change.

astewart-xmos commented 4 years ago

Spoke to @keithm-xmos about this.

Decided we'll try solution 2 for now.

Keith did register some concern that doing many sparse reads from flash may be very slow compared to doing a single contiguous read. That's something we should be aware of when verifying this change. If it's too much slower we may have to fall back to solution 1.


With this solution the code calling conv2d_depthwise() will have to copy a slice of the weights tensor (as well as the BSS tensor):

void memcopy_depthwise_subtensor(
    int8_t* dest, 
    const int8_t* weights, 
    const unsigned K_h, 
    const unsigned K_w, 
    const unsigned X_c, 
    const unsigned start_channel, 
    const unsigned channel_count)
{
    assert(start_channel % 16 == 0);
    assert(channel_count % 4 == 0);

    weights = &(weights[start_channel]); // Address of weights[0][0][start_channel]

    // Total of K_h * K_w blocks, for a total of K_h*K_w*channel_count bytes
    for(int k = 0; k < K_h * K_w; k++){
        memcpy(dest, weights, channel_count);
        dest = &(dest[channel_count]);
        weights = &(weights[X_c]);
    }
}

Additionally, the calling code will need to override a parameter in the initialized job that causes the kernel to assume the whole weight tensor is in memory. Not sure what that parameter is yet.

astewart-xmos commented 4 years ago

I think I have this working now.

The parameter to be overridden in the integration code is nn_conv2d_depthwise_job_t.stride.k_channels.

So, suppose we have a depthwise layer with 40 output channels. If only one output group's coefficients are being loaded into RAM, the job which computes output channels 16-31 should get the following overrides after being initialized.

    job->stride.start.K = 0; // This allows you to just give the start address of the SRAM buffer for the weights
    job->stride.start.BSO = 0; //This allows you to just give the start address of the SRAM buffer for the BSO tensor
    job->stride.k_channels = 16; // One output group in weight tensor
keithm-xmos commented 3 years ago

Closed by https://github.com/xmos/ai_tools/pull/245