First an Supervised Learning policy trained from expert move, predict human moves
A fast rollout policy trained similarly, but less features & with smaller network
Then a self play policy gradient initialized from SL policy, train for winning
Then a value network from the self play dataset
Then from root, pick action with RL policy, expand leaf with SL policy, value = weighted mean of estimate from value network + MC estimate from fast rollout policy; actual action = node with most visits
Search can be done async on CPU; while network prediction on GPU
Only black & white stones as features (AlphaGo used some other features)
Single network for both action distribution and winning prediction(unlike policy & value): (p, v) = f(s); p = vector of p(a|s); v = p(winning from current position); f = network; s = board position and history
Inference: Simple tree search without MC rollouts
Training: match (p, v) with (pie, z) where pie is the move probability from MCTS and z is the winner of the sample; see equation (1) for the loss function
The MCTS rollout is same as AlphaGo paper; pie is proportional to exponentiated visit count for each move
Evaluated residual vs convolutional & separate vs dual network; res-dual is best
Yes 0 paper in August. Shame :(
In September I read the 3 AlphaGo papers.
AlphaGo
AlphaGo Zero
AlphaZero