LSTM and ConvLSTM shapes

I spent some time figuring out the exact shape of each step in LSTM and ConvLSTM and below is the resulting note.

I find "know thy shapes" to be a very effective middleground for understanding model architectures, between only know the basic idea ("LSTM = a state cell controlled by 3 gates") and implement everything in Python with no frameworks.

First some notations:

h_t: hidden state (i.e. the output of LSTM cell) at timestep t
f_t: forget gate
i_t: input gate
q_t: output gate
X_t: input
S_t: cell state
kernel: weight matrix to apply on input
recurrent_kernel: weight matrix to apply on recurrent hidden state.

And notations for operations:

x: matrix multiplication
*: element-wise multiplication, eg. [1,2] * [3,4] = [3,8], also known as Hadamard Product.
o: convolution operation. An input of shape (x,y,z) convolved by a kernel of shape (a,b) with n filters would result in shape (x-a+1, y-b+1, n), assuming 1x1 stride and no padding. Note in ConvLSTM2D there's no convolution on z.

Below is a summary table for quick reference, assuming 1x1 stride and no padding for ConvLSTM2D:

LSTM	ConvLSTM2D
model parameter	units	filters, 2D convolution kernel: (conv_kernel_x, conv_kernel_y)
input	(batch, timestep, input_dim)	(batch, timestep, width, height, channel)
gates	(units,1)	(width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
kernel	(units, input_dim)	(conv_kernel_x, conv_kernel_y, channel, filters)
recurrent_kernel	(units, units)	(conv_kernel_x, conv_kernel_y, filters, filters)
cell_state	(units, 1)	(width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
hidden_state/output (without batch)	(units, 1)	(width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)

LSTM

LSTMs (or RNNs in general) are for sequence learning problems. LSTMs are robust against vanishing/exploding gradients by design.

TensorFlow r1.12 implementation

Note this is "1D LSTM", i.e. each element of the input sequence must have one dimension only, for example, a one hot encoding.

Input

Input shape has 3 dimensions: (batch, timestep, input_dim)

batch is number of sequences per training step; timestep is the length of each sequence; input_dim is length of each element in the sequence, eg. a word embedding vector.

For example, if we feed a sentence-prediction model 8 sentences at a time, each sentence has 100 characters, each character is turned into a word embedding vector of size 20, then the input shape would be (8, 100, 20).

Model

We can specify one model parameter: units, which is the shape of output hidden state.

The internal calculations are as follows:

forget gate:

f_t =                                   | (units, 1)
    recurrent_activation(               | element-wise activation, typically sigmoid
        kernel x X_t +                  | (units, input_dim) x (input_dim, 1) -> (units, 1)
        recurrent_kernel x h_{t-1} +    | (units, units) x (units, 1) -> (units, 1)
        bias                            | (units, 1)
    )

Same formulae applies for input gate and output gate, each with their own kernel, recurrent_kernel, and bias

Cell state:

S_t =                                  | (units, 1)
    f_t * S_{t-1} +                    | (units, 1) * (units, 1) -> (units, 1)
    i_t *                              | (units, 1) * (units, 1) -> (units, 1)
      activation(                      | element-wise activation, typically tanh
        kernel x X_t +                 | (units, input_dim) x (input_dim, 1) -> (units, 1)
        recurrent_kernel x h_{t-1} +   | (units, units) x (units, 1) -> (units, 1)
        bias                           | (units, 1)
      )

Hidden state (aka output of LSTM cell):

h_t =                                  | (units, 1)
    q_t * activation(S_t)              | (units, 1) * (units, 1) -> (units, 1)

In summary, shapes of internal weights are:

forget gate / input gate / output gate: (units, 1)
kernel: (units, input_dim)
recurrent_kernel: (units, units)
cell state: (units, 1)
hidden state / output: (units, 1)

ConvLSTM2D

ConvLSTM is for spatial-temporal sequence learning problems. The key insight is that in LSTMs, the input-to-state and state-to-state are fully-connected operations (matrix multiplication), which can be replaced by convolution to extract spatial features more efficiently.

TensorFlow r1.12 implementation

Input

Input shape has 5 dimensions: (batch, timestep, width, height, channel).

The difference with LSTM is ConvLSTM expands single input_dim to 3D: width, height and channel.

For example, if we feed a video frame prediction model 8 sets of frames at a time, each set contains 100 frames, each frame is of dimension 128x128x3, then the input shape would be (8, 100, 128, 128, 3).

Model

We can specify 2 model parameters: number of filters, and 2D kernel size (referred to as conv_kernel below).

You can also specify stride and padding, here we assume a 1x1 stride and no padding.

forget gate

f_t =                                | (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
    recurrent_activation(            
        input o kernel +              | (width, height, channel) o (conv_kernel_x, conv_kernel_y, channel, filters) -> (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
        h_{t-1} o recurrent_kernel +  | (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters) o (conv_kernel_x, conv_kernel_y, filters, filters) -> (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
        bias                          | (filters, 1), broadcast to each element within a filter
    )

Note the convolution in h_{t-1} o recurrent_kernel is hardcoded to use 1x1 stride and padding = 'SAME' (TensorFlow code for recurrent_conv), therefore the output shape of convolution does not change.
Same formulae applies for input gate and output gate, each with their own kernel, recurrent_kernel, and bias.

Cell state

S_t =                                    | (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
    f_t * S_{t-1} +                      | (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters) * (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters) -> (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
    activation(
            input o kernel +              | (width, height, channel) o (conv_kernel_x, conv_kernel_y, channel, filters) -> (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
            h_{t-1} o recurrent_kernel +  | (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters) o (conv_kernel_x, conv_kernel_y, filters, filters) -> (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
            bias                          | (filters, 1), broadcast to each element within a filter
        )

Hidden state

h_t =                     | (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters) 
    q_t * activation(S_t) | (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters) * (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters) -> (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)

In summary, shapes of internal weights are:

forget gate / input gate / output gate: (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
kernel: (kernel_x, kernel_y, channel, filters)
recurrent_kernel: (kernel_x, kernel_y, filters, filters)
cell state: (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
hidden state / output: (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)

xysun / blog

Shapes of LSTM and ConvLSTM #10

LSTM and ConvLSTM shapes

LSTM

Input

Model

ConvLSTM2D

Input

Model