Paper reading Sep 2019 [3]

Yes 0 paper in August. Shame :(

In September I read the 3 AlphaGo papers.

Combine tree search with policy + value network
First an Supervised Learning policy trained from expert move, predict human moves
A fast rollout policy trained similarly, but less features & with smaller network
Then a self play policy gradient initialized from SL policy, train for winning
Then a value network from the self play dataset
Then from root, pick action with RL policy, expand leaf with SL policy, value = weighted mean of estimate from value network + MC estimate from fast rollout policy; actual action = node with most visits
Search can be done async on CPU; while network prediction on GPU

No human knowledge
Starting from random play
Only black & white stones as features (AlphaGo used some other features)
Single network for both action distribution and winning prediction(unlike policy & value): (p, v) = f(s); p = vector of p(a|s); v = p(winning from current position); f = network; s = board position and history
Inference: Simple tree search without MC rollouts
Training: match (p, v) with (pie, z) where pie is the move probability from MCTS and z is the winner of the sample; see equation (1) for the loss function
The MCTS rollout is same as AlphaGo paper; pie is proportional to exponentiated visit count for each move
Evaluated residual vs convolutional & separate vs dual network; res-dual is best

Generalized to other games.
Go: rotation/reflection invariant; binary outcome; particularly fit for conv net
AlphaZero: no data augmentation; no hyper parameter tuning (AlphaGo Zero use Bayesian optimization)

xysun / blog