werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.46k stars 606 forks source link

Determining who is next to play inside the MCST #19

Closed fidel-schaposnik closed 4 years ago

fidel-schaposnik commented 4 years ago

In https://github.com/werner-duvaud/muzero-general/blob/283e3538485be0e36ef77f402249666f735f5278/self_play.py#L262 you essentially assume actions are taken by players in alternating order for two-player games. In a way, this is a rule ("Players take turns making moves") leaking into the MCTS, whereas we would like to assume we can only know who is next-to-play at the root of the tree where we can query the environment. Inside the tree, it would be more consistent to have the next-to-play be computed by the dynamics function, right?

This is also a limitation in the way actions are encoded: for example, my understanding is that castling in chess is encoded as two separate, consecutive moves made by the same player, but this would break the MCTS logic as it stands here. Any idea how this was handled by the original authors?

werner-duvaud commented 4 years ago

I understand your idea, I think it is feasible but we must add an element to the prediction of the dynamic network.

For the moment we are trying to follow what is done in the paper, you can see that in the pseudocode, the current player during the MCTS search is given exactly by the game (the to_play() method of ActionHistory). This is the only external information from MCTS.

In our code there is only the possibility to play in turn because all the games implemented so far work like this but it can be changed easily. This is to keep the MCTS class independent from other classes.

About castling, it's encoded as a separate plan. It is considered as a single move. This is explained in the AlphaZero paper.

fidel-schaposnik commented 4 years ago

Great, thanks for the clarification on the encoding of castling!