xysun / blog

Mainly my paper reading notes
5 stars 0 forks source link

Shapes of LSTM and ConvLSTM #10

Open xysun opened 5 years ago

xysun commented 5 years ago

LSTM and ConvLSTM shapes

I spent some time figuring out the exact shape of each step in LSTM and ConvLSTM and below is the resulting note.

I find "know thy shapes" to be a very effective middleground for understanding model architectures, between only know the basic idea ("LSTM = a state cell controlled by 3 gates") and implement everything in Python with no frameworks.

First some notations:

And notations for operations:

Below is a summary table for quick reference, assuming 1x1 stride and no padding for ConvLSTM2D:

LSTM ConvLSTM2D
model parameter units filters, 2D convolution kernel: (conv_kernel_x, conv_kernel_y)
input (batch, timestep, input_dim) (batch, timestep, width, height, channel)
gates (units,1) (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
kernel (units, input_dim) (conv_kernel_x, conv_kernel_y, channel, filters)
recurrent_kernel (units, units) (conv_kernel_x, conv_kernel_y, filters, filters)
cell_state (units, 1) (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
hidden_state/output (without batch) (units, 1) (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)

LSTM

LSTMs (or RNNs in general) are for sequence learning problems. LSTMs are robust against vanishing/exploding gradients by design.

TensorFlow r1.12 implementation

Note this is "1D LSTM", i.e. each element of the input sequence must have one dimension only, for example, a one hot encoding.

Input

Input shape has 3 dimensions: (batch, timestep, input_dim)

batch is number of sequences per training step; timestep is the length of each sequence; input_dim is length of each element in the sequence, eg. a word embedding vector.

For example, if we feed a sentence-prediction model 8 sentences at a time, each sentence has 100 characters, each character is turned into a word embedding vector of size 20, then the input shape would be (8, 100, 20).

Model

We can specify one model parameter: units, which is the shape of output hidden state.

The internal calculations are as follows:

In summary, shapes of internal weights are:

ConvLSTM2D

ConvLSTM is for spatial-temporal sequence learning problems. The key insight is that in LSTMs, the input-to-state and state-to-state are fully-connected operations (matrix multiplication), which can be replaced by convolution to extract spatial features more efficiently.

TensorFlow r1.12 implementation

Input

Input shape has 5 dimensions: (batch, timestep, width, height, channel).

The difference with LSTM is ConvLSTM expands single input_dim to 3D: width, height and channel.

For example, if we feed a video frame prediction model 8 sets of frames at a time, each set contains 100 frames, each frame is of dimension 128x128x3, then the input shape would be (8, 100, 128, 128, 3).

Model

We can specify 2 model parameters: number of filters, and 2D kernel size (referred to as conv_kernel below).

You can also specify stride and padding, here we assume a 1x1 stride and no padding.

In summary, shapes of internal weights are: