werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.46k stars 606 forks source link

Question: Why use negative board for observation in connect4 training? #8

Closed littleV closed 4 years ago

littleV commented 4 years ago

Hi,

I'm trying to write a Gomoku game with MuZero. I'm learning from Connect4 since it's also a two player game. However I noticed the following code:

def get_observation(self):
        if self.player == 1:
            return self.board
        else:
            return -self.board

Why are you returning the negative board for one of the player not letting them to learn from the same board?

werner-duvaud commented 4 years ago

Hi,

It's a great project. We return the negative board according to the player so that the board is always oriented according to the current player. The board consists of zeros for empty spaces, 1 for the game pieces of one player and -1 for those of another player. Multiplying by -1 therefore only changes the point of view.

However, this mechanism is temporary to test the fully connected network, the shape of the plate is more conducive to the use of a residual network, as in AlphaZero and MuZero papers. We will soon commit it. We then encoded each player as a plan with only his game pieces.

littleV commented 4 years ago

@werner-duvaud

Thanks for explanation. I'm new to machine learning so I have a lot of questions about how the program works.

So do you suggest me to do the same for Gomoku game?

What's a good representation of the board? Using a matrix or an array? Can legal actions be a pair of integers (as in position on the board)? What is the use of action_space in the MuZero Config?

I see a continuous_self_play method in the self_play.py, how is the self play process stopped if the program always runs in while True?

werner-duvaud commented 4 years ago

@littleV

I think the way to encode the board is quite free. If you use the fully connected network the representation will be in the form of a list. If you use the residual network, we generally encode the board on several planes / matrices. You can take inspiration from what has been done for AlphaZero.

Currently, all possible actions are numbered, it's the action_space in the config, MuZero returns one of these numbers to designate the action it wants to perform. Legal actions are a list with the numbers representing the legal actions.

continuous_self_play is an infinite loop that runs in a separate process during which MuZero generates self played games. The process is stopped by ray.shutdow() when training_steps is reached. It takes place in the train method in muzero.py.

littleV commented 4 years ago

@werner-duvaud

I was able to write the game and train a model. I trained for 10000 steps and around 700 games.

However, when I test the model in human play mode, the model only learned to not place a stone where I placed. It always starts in the same position and places the same action sequence if I don't place stones at those places. This means the training failed since the model didn't learn what to do with my moves.

Can you shed some lights on how to tune the training? Or Can I submit a PR for you to take a look? And eventually if the game runs successfully, can the PR be merged?

werner-duvaud commented 4 years ago

@littleV

Great to see that it trains.

The number of self played games is important. It is also necessary to adjust the hyper parameters, for instance td_steps must be greater than the maximum number of moves to end a game. There are details on the hyper parameters in the paper and in the pseudocode. We will try to summarize it in the documentation. Also for board games, a residual network will work better. We just released the ResNet.

By the way, it takes a lot of training, for AlphaZero it took tens of millions of games played in chess. It will surely be much less for Gomoku but it is difficult to estimate. We are in the process of testing a checkers game on a supercomputer.

If you have a well-trained version of Gomoku, it will be a pleasure to merge it.