I spent some time figuring out the exact shape of each step in LSTM and ConvLSTM and below is the resulting note.
I find "know thy shapes" to be a very effective middleground for understanding model architectures, between only know the basic idea ("LSTM = a state cell controlled by 3 gates") and implement everything in Python with no frameworks.
First some notations:
h_t: hidden state (i.e. the output of LSTM cell) at timestep t
f_t: forget gate
i_t: input gate
q_t: output gate
X_t: input
S_t: cell state
kernel: weight matrix to apply on input
recurrent_kernel: weight matrix to apply on recurrent hidden state.
And notations for operations:
x: matrix multiplication
*: element-wise multiplication, eg. [1,2] * [3,4] = [3,8], also known as Hadamard Product.
o: convolution operation. An input of shape (x,y,z) convolved by a kernel of shape (a,b) with n filters would result in shape (x-a+1, y-b+1, n), assuming 1x1 stride and no padding. Note in ConvLSTM2D there's no convolution on z.
Below is a summary table for quick reference, assuming 1x1 stride and no padding for ConvLSTM2D:
Note this is "1D LSTM", i.e. each element of the input sequence must have one dimension only, for example, a one hot encoding.
Input
Input shape has 3 dimensions: (batch, timestep, input_dim)
batch is number of sequences per training step; timestep is the length of each sequence; input_dim is length of each element in the sequence, eg. a word embedding vector.
For example, if we feed a sentence-prediction model 8 sentences at a time, each sentence has 100 characters, each character is turned into a word embedding vector of size 20, then the input shape would be (8, 100, 20).
Model
We can specify one model parameter: units, which is the shape of output hidden state.
The internal calculations are as follows:
forget gate:
f_t = | (units, 1)
recurrent_activation( | element-wise activation, typically sigmoid
kernel x X_t + | (units, input_dim) x (input_dim, 1) -> (units, 1)
recurrent_kernel x h_{t-1} + | (units, units) x (units, 1) -> (units, 1)
bias | (units, 1)
)
Same formulae applies for input gate and output gate, each with their own kernel, recurrent_kernel, and bias
ConvLSTM is for spatial-temporal sequence learning problems. The key insight is that in LSTMs, the input-to-state and state-to-state are fully-connected operations (matrix multiplication), which can be replaced by convolution to extract spatial features more efficiently.
Input shape has 5 dimensions: (batch, timestep, width, height, channel).
The difference with LSTM is ConvLSTM expands single input_dim to 3D: width, height and channel.
For example, if we feed a video frame prediction model 8 sets of frames at a time, each set contains 100 frames, each frame is of dimension 128x128x3, then the input shape would be (8, 100, 128, 128, 3).
Model
We can specify 2 model parameters: number of filters, and 2D kernel size (referred to as conv_kernel below).
You can also specify stride and padding, here we assume a 1x1 stride and no padding.
forget gate
f_t = | (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
recurrent_activation(
input o kernel + | (width, height, channel) o (conv_kernel_x, conv_kernel_y, channel, filters) -> (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
h_{t-1} o recurrent_kernel + | (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters) o (conv_kernel_x, conv_kernel_y, filters, filters) -> (width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
bias | (filters, 1), broadcast to each element within a filter
)
Note the convolution in h_{t-1} o recurrent_kernel is hardcoded to use 1x1 stride and padding = 'SAME' (TensorFlow code for recurrent_conv), therefore the output shape of convolution does not change.
Same formulae applies for input gate and output gate, each with their own kernel, recurrent_kernel, and bias.
LSTM and ConvLSTM shapes
I spent some time figuring out the exact shape of each step in LSTM and ConvLSTM and below is the resulting note.
I find "know thy shapes" to be a very effective middleground for understanding model architectures, between only know the basic idea ("LSTM = a state cell controlled by 3 gates") and implement everything in Python with no frameworks.
First some notations:
h_t
: hidden state (i.e. the output of LSTM cell) at timestept
f_t
: forget gatei_t
: input gateq_t
: output gateX_t
: inputS_t
: cell statekernel
: weight matrix to apply on inputrecurrent_kernel
: weight matrix to apply on recurrent hidden state.And notations for operations:
x
: matrix multiplication*
: element-wise multiplication, eg.[1,2] * [3,4] = [3,8]
, also known as Hadamard Product.o
: convolution operation. An input of shape(x,y,z)
convolved by a kernel of shape(a,b)
withn
filters would result in shape(x-a+1, y-b+1, n)
, assuming1x1
stride and no padding. Note in ConvLSTM2D there's no convolution onz
.Below is a summary table for quick reference, assuming
1x1
stride and no padding for ConvLSTM2D:LSTM
LSTMs (or RNNs in general) are for sequence learning problems. LSTMs are robust against vanishing/exploding gradients by design.
TensorFlow r1.12 implementation
Note this is "1D LSTM", i.e. each element of the input sequence must have one dimension only, for example, a one hot encoding.
Input
Input shape has 3 dimensions: (batch, timestep, input_dim)
batch
is number of sequences per training step;timestep
is the length of each sequence;input_dim
is length of each element in the sequence, eg. a word embedding vector.For example, if we feed a sentence-prediction model 8 sentences at a time, each sentence has 100 characters, each character is turned into a word embedding vector of size 20, then the input shape would be
(8, 100, 20)
.Model
We can specify one model parameter:
units
, which is the shape of output hidden state.The internal calculations are as follows:
forget gate:
Same formulae applies for input gate and output gate, each with their own
kernel
,recurrent_kernel
, andbias
Cell state:
Hidden state (aka output of LSTM cell):
In summary, shapes of internal weights are:
(units, 1)
(units, input_dim)
(units, units)
(units, 1)
(units, 1)
ConvLSTM2D
ConvLSTM is for spatial-temporal sequence learning problems. The key insight is that in LSTMs, the input-to-state and state-to-state are fully-connected operations (matrix multiplication), which can be replaced by convolution to extract spatial features more efficiently.
TensorFlow r1.12 implementation
Input
Input shape has 5 dimensions: (batch, timestep, width, height, channel).
The difference with LSTM is ConvLSTM expands single
input_dim
to 3D: width, height and channel.For example, if we feed a video frame prediction model 8 sets of frames at a time, each set contains 100 frames, each frame is of dimension 128x128x3, then the input shape would be
(8, 100, 128, 128, 3)
.Model
We can specify 2 model parameters: number of filters, and 2D kernel size (referred to as
conv_kernel
below).You can also specify stride and padding, here we assume a
1x1
stride and no padding.forget gate
h_{t-1} o recurrent_kernel
is hardcoded to use1x1
stride and padding = 'SAME' (TensorFlow code forrecurrent_conv
), therefore the output shape of convolution does not change.kernel
,recurrent_kernel
, andbias
.Cell state
Hidden state
In summary, shapes of internal weights are:
(width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
(kernel_x, kernel_y, channel, filters)
(kernel_x, kernel_y, filters, filters)
(width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)
(width - conv_kernel_x + 1, height - conv_kernel_y + 1, filters)