werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.5k stars 612 forks source link

Tracking for support of 2-player games with interrupts #54

Closed mathstuf closed 4 years ago

mathstuf commented 4 years ago

Per https://github.com/werner-duvaud/muzero-general/issues/26#issuecomment-617132864 and https://github.com/werner-duvaud/muzero-general/issues/19#issuecomment-597909095 the code only supports two-player games with strict alternating actions. What would it take to encode the "next player" for the available actions into the game state? I'm thinking of games with rules where the state of the board may require an action from a player out of turn order (e.g., in GIPF, making a line of 4 pieces compels the owner of those pieces to remove them from the board regardless of whose turn the line(s) appeared on). For such a case, the action space could encode "this action compels removal from the board", but it breaks down when the opponent needs to make that decision before their turn and having the current player "decide" for them is…not ideal.

The above mentioned issues were closed without comment as to why they were closed. Resolved? Discussion over? No links to the code and I don't see anything in the game class about supporting determining the next player.

werner-duvaud commented 4 years ago

Hi,

The problem is that in the MCTS search, MuZero plays in his head, that is to say that unlike AlphaZero, it does not have access to the real state of the game.

I see four options:

fidel-schaposnik commented 4 years ago

Edit: Didn't see the reply above in time, sorry for double posting!

The main constraint when dealing with more complicated turn orders is that MuZero can only access the environment to query who is next to play at the root of the Monte Carlo search tree, but not at internal nodes (where it only has access to a hidden state in some internal, learned representation). However, it needs this information to decide e.g. who gets the reward predicted by the dynamics head, so this needs to be dealt with in some way or another (and is also an issue for games with more than 2 players). As far as I can see, there's two main approaches to the problem:

  1. Encoding the next player in the action space

For 2-player games, you can always assume players alternate to make moves: if a player can take two consecutive actions in some situation, you may enlarge the action space to define an extended_action_space_size = action_space_size+1, adding one action to represent both of the original ones together (consolidating rewards and state changes in your environment implementation). Repeating this process until it cannot be applied anymore, one ends up with an equivalent game with a larger action space yet alternating turns.

More generally, for any number of players you may define extended_action_space_size = action_space_size * num_players, such that Action(i, j) in the extended action space represents "current player performs Action(i), next to play is Player(j)". As above, now playing order can be determined by the ActionHistory using just the sequence of actions taken so far, which is available also in the internal nodes of the search tree.

  1. Learning whose turn is next along with the other rules of the game

As mentioned in https://github.com/werner-duvaud/muzero-general/issues/19#issuecomment-597909095 , an alternative approach is to add another head to the dynamics function, so that it also predicts which player is next to play after an action is performed. This could be a simple classifier with num_players categories, attached to the dynamics hidden state much like the policy head is attached in the prediction function. A categorical cross-entropy loss would then be added to the model's loss function, so that MuZero learns to classificy hidden states according to who is next to play based on the games observed in the replay buffer.

Computationally this is not much different from (1), where one didn't add any new heads or losses but had to extend the policy head instead. However, it is morally nicer, since it means MuZero is learning the rules of the game from scratch, without having any of them artificially encoded in the action space.

Regardless of the choice of either (1) or (2), it is also necessary to be more careful about how rewards and values are handled when dealing with more than two players, or players who do not alternate turns, because simply swapping reward signs with every move won't be sufficient anymore. I think the cleanest way to approach this is to simply have all rewards and values be absolute (instead of relative to which player is playing). The dynamics and prediction functions would then produce num_players independent outputs each, and the MCTS agent then chooses which of these to consider according to who is next to play.

Note: While none of this is particularly innovative, I think the conclusion was that the code in this repository should follow the original paper as much as possible, so it hasn't been implemented here. I've tried these ideas in a different implementation, and they seem to work as expected. However, training MuZero is already hard enough, so adding these extra complications (i.e. very large action spaces, new heads and losses, etc) just increases training time and makes things quite unstable.

mathstuf commented 4 years ago

Change the game so that it looks like a game where players play in turn in adding steps where nothing happend, this will make it compatible with the fact that players play in turn which is hard coded.

Hmm. This seems the simplest. An interrupt can be seen as a force-pass on the other player's turn. This is probably the way I'd go for now.

It is doable and I find it more in the spirit of MuZero but we preferred to remain close to the paper on this point for now

I think the conclusion was that the code in this repository should follow the original paper as much as possible, so it hasn't been implemented here.

Hmm, so then what's up with this section then? :) https://github.com/werner-duvaud/muzero-general#further-improvements

Maybe a "rejected features" section? Though I have a hard time seeing this as "too different" versus (what I interpret as) "Support stochastic environments" since that sounds like hidden-information games like Dominion (simulating the actual card draw is arguably cheating). Though the 21 implementation here seems just just expect training to cover all possible draw cards by assuming an infinite supply of shuffled decks over the training period. I don't think that works if you want to simulate casino 21 (N shuffled decks) or home (1, maybe 2 decks) variants.

werner-duvaud commented 4 years ago

Well, addding a network for predicting the next player is definitely not a rejected feature. As I said, we are still thinking about adding a new branch with this feature but we are not sure about its priority. It can be part of the "Better integration with more than two player games" of the further improvments section. I'm going to open an issue when we take a serious look at the further improvments taks especially the "more than two player" and "not playing in turn" tasks (if someone already has specific ideas on it, we can start talking about it here or in the discord server)

The further improvments section is just ideas and is not exhaustive. Working on games with imperfect information and more than two players not playing in turn is quite difficult :)

mathstuf commented 4 years ago

OK, thanks for the details.

Working on games with imperfect information and more than two players not playing in turn is quite difficult :)

The game I'm looking at is perfect information, but has the "interrupt" wrinkle. I'm happy this project exists since I wanted to code up some of the abstracts I like to see how an AlphaZero-like AI would perform. This generalization is certainly a good boost to actually completing that :) . Thanks.