werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.46k stars 606 forks source link

Including reward in Node.value() #27

Closed fidel-schaposnik closed 4 years ago

fidel-schaposnik commented 4 years ago

I see in https://github.com/werner-duvaud/muzero-general/blob/633c658840397b674900506ae88d654f30c47270/self_play.py#L321 you are now including the Node.reward when computing the UCB score (as opposed to the original pseudocode). Can you explain the logic why this shouldn't be extended to other places where the value is used, e.g. when updating normalizations (https://github.com/werner-duvaud/muzero-general/blob/633c658840397b674900506ae88d654f30c47270/self_play.py#L336 ) or storing statistics ( https://github.com/werner-duvaud/muzero-general/blob/633c658840397b674900506ae88d654f30c47270/self_play.py#L436 )?

ahainaut commented 4 years ago

Hi,

We decided to include the reward as suggested by the second version of the pseudo code However, we found that the value used in the UCB score was not between 0 and 1, as suggested by the original paper.
After further investigations, we found that it was preferable to normalize by including the reward. Therefore, min and max values will be updated as you mentioned. Concerning the value in the store_search_statistics method, we don’t think the value needs to be different from the pseudocode.