Paper reading Feb 2019 [2]

ConvLSTM
- For spatial-temporal sequence learning problems.
- Replace input-to-state and state-to-state fully-connected layers with convolution layers.
- I wrote a note about shape details here.
A3C: I'm working on code.
- Key idea is using async gradient updates from parallel agents to replace experience reply as a way to de-correlate experience. This enables onpolicy algorithms eg SARSA and actor-critic.
- Experience replay only works for off policy algorithms like Q learning, because the Q update is from actions generated by older network.
- Algorithm details for a3c:
  - Thread level parallelisation instead of process level (well... I think we can use queues on single machine with multiple Python processes, because GIL really sucks)
  - Shared weights for policy and value network with different output layer: softmax vs linear
  - Entropy regularisation
  - Each agent has different exploration strategy (different sigma)
  - Optimisation algorithm: RMSPop with shared g
  - LSTM layer in the end
- Showed that A3C or async RL method is
  1. Robust: can learn from wide range of initialization and learning rates. That is a great graph.
  2. Learning speed scales linearly w.r.t number of parallel agents
  3. Stabilizing effect for value based algorithms (Q learning and SARSA)

xysun / blog