Key idea is using async gradient updates from parallel agents to replace experience reply as a way to de-correlate experience. This enables onpolicy algorithms eg SARSA and actor-critic.
Experience replay only works for off policy algorithms like Q learning, because the Q update is from actions generated by older network.
Algorithm details for a3c:
Thread level parallelisation instead of process level (well... I think we can use queues on single machine with multiple Python processes, because GIL really sucks)
Shared weights for policy and value network with different output layer: softmax vs linear
Entropy regularisation
Each agent has different exploration strategy (different sigma)
Optimisation algorithm: RMSPop with shared g
LSTM layer in the end
Showed that A3C or async RL method is
Robust: can learn from wide range of initialization and learning rates. That is a great graph.
Learning speed scales linearly w.r.t number of parallel agents
Stabilizing effect for value based algorithms (Q learning and SARSA)
sigma
)